RankE: End-to-End Post-Training for Discrete
Text-to-Image Generation with Decoder Co-Evolution

Siyong Jian1*, Siyuan Li1,2*, Luyuan Zhang3*, Zedong Wang4, Xin Jin1, Ying Li1, Cheng Tan5†, Huan Wang1†
1Westlake University     2Zhejiang University     3Tsinghua University
4HKUST     5Shanghai AI Lab
*Equal contribution  ·  †Corresponding author
Paper Code BibTeX
Comparison between existing AR post-training (frozen decoder) and the RankE framework (co-evolved decoder)
Existing AR post-training freezes the VQ decoder and optimizes only the AR policy with T2I rewards — inducing Latent Covariate Shift. RankE co-evolves both modules with complementary T2I and pixel rewards, closing the shift and breaking the fidelity–alignment trade-off.

Abstract


Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. We show that this policy-only optimization induces Latent Covariate Shift: as the policy evolves under reward pressure, the token distribution diverges from the ground-truth distribution on which the decoder was trained, so reward scores improve while decoded image quality degrades.

To address this mismatch, we propose RankE — the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. On LlamaGen-XL (775M), standard RL improves CLIP but degrades FID; RankE improves both simultaneously, reaching FID 15.21 and CLIP 33.76 on MS-COCO 30K, and consistent gains hold on Janus-Pro-1B and under the HPSv2 reward.

The Problem: Latent Covariate Shift


During tokenizer pre-training, the VQ decoder is trained exclusively on deterministic ground-truth codes $z_{\mathrm{gt}} = \mathrm{Quantize}(E(x))$, which occupy a restricted, low-variance region of the latent space. At inference, the same decoder receives tokens sampled from the AR policy, $\hat{z} \sim \pi_\theta(\cdot \mid y)$, whose distribution progressively diverges from this regime as the policy evolves under reward pressure.

This divergence produces a fidelity–alignment trade-off that policy-side tuning alone cannot resolve. GRPO applied to LlamaGen-XL improves CLIP yet degrades FID across checkpoints, and the KL divergence against ground-truth token statistics confirms that standard RL substantially widens the distributional gap relative to SFT.

Latent Covariate Shift quantification and CLIP-FID Pareto front
Left: KL divergence between token distribution and the ground-truth VQ token distribution. Standard RL increases the divergence over SFT, whereas RankE keeps it near the SFT level via decoder co-evolution. Right: Standard RL yields small CLIP improvements with non-improving FID; RankE attains a Pareto improvement, increasing CLIP while decreasing FID.

Unlike exposure bias, which concerns the input context of the generator, Latent Covariate Shift targets the input distribution of the decoder — a mismatch that no amount of policy-level tuning can resolve.

The RankE Framework


RankE jointly evolves the policy and the decoder without differentiating through the discrete bottleneck. The name reflects two ranking-based mechanisms at complementary granularities: a token-level ranking objective drives the policy update, and a pixel-level ranking objective drives the decoder update. We cast this as a Generalized EM procedure with an alternating schedule.

RankE two-stage co-evolution framework
Stage 1 — Latent Ranking (Policy): with the decoder fixed, the AR generator is updated via group-relative preference optimization with EMA-KL regularization. Stage 2 — Pixel Ranking (Decoder): with the policy fixed, the VQ decoder is adapted on policy-sampled latents through a composite objective: Rank-GAN (reward-weighted adversarial loss) + reward back-propagation + ground-truth reconstruction + EMA consistency distillation.

A unified two-stage objective

Although the updates live on incompatible parameter spaces — discrete tokens for the policy $\pi_\theta$ and continuous pixels for the decoder $D_\phi$ — they share a common structure:

$$ \max_{\Psi \in \{\theta,\, \phi\}} \;\; \mathcal{J}(\Psi) \;=\; \underbrace{\mathbb{E}\!\left[\,\mathcal{A}_{\Psi}\,\right]}_{\text{ranking-based alignment}} \;-\; \underbrace{\lambda\,\Omega(\Psi)}_{\text{stability-preserving regularization}} $$

Crucially, $\mathcal{A}_\Psi$ is implemented through relative ranking rather than absolute reward magnitude: at every step we draw $G$ rollouts from the same prompt, score them with $r$, and update $\Psi$ toward higher-ranked samples. The two stages run alternately within each round and across $K$ rounds — reward information crosses the discrete bottleneck through the alternation rather than through any single gradient path.

Stage 1 — Token-level ranking via GRPO

With the decoder fixed, the policy is updated using Group Relative Policy Optimization. For each prompt, $G$ rollouts are drawn, decoded, scored under $r$, and converted into group-normalized advantages $A_i = (r_i - \mu_r) / \sigma_r$. The clipped-advantage loss is the token-level ranking term $\mathcal{A}_\theta$, and the KL against an EMA reference serves as the stability anchor $\Omega_\theta$.

Stage 2 — Pixel-level ranking via Rank-GAN

With the policy fixed, the decoder is allowed to track its evolving token distribution. The same $G$ rollouts that were just ranked in token space are now re-ranked in pixel space. Mirroring Stage 1, the decoder loss decomposes into an alignment block (Rank-GAN + differentiable reward) and a manifold-anchor block (ground-truth reconstruction + EMA-consistency distillation):

$$ \mathcal{L}_{D}(\phi) = \underbrace{\lambda_d \mathcal{L}_{\mathrm{reward}} + \lambda_g \mathcal{L}_{\mathrm{Rank\text{-}GAN}}}_{\mathcal{A}_\phi: \text{ pixel-level ranking}} \;+\; \underbrace{\lambda_r \mathcal{L}_{\mathrm{recon}} + \lambda_c \mathcal{L}_{\mathrm{consist}}}_{\Omega_\phi: \text{ manifold anchor}} $$

Rank-GAN is the key innovation in the decoder: it preserves the expected gradient magnitude of a vanilla GAN while concentrating updates on policy-preferred samples via reward-derived weights $w(\hat{z}_i) \propto \exp(r_i / \tau)$. Replacing Rank-GAN with a uniform GAN drops both CLIP and FID, confirming that reward weighting is the active ingredient.

Results


Quantitative — CLIP-based optimization

Standard RL improves CLIP but degrades FID; RankE co-evolves the decoder, achieving both higher CLIP and lower FID simultaneously. Green = gain over Std. RL.

Backbone Method (Reward) Decoder CLIP ↑ FID ↓ GenEval ↑
LlamaGen-XL · 775M
LlamaGen-XL Base Frozen 31.54 15.24 0.309
SFT Frozen 31.86 16.58 0.374
Std. RL (CLIP) Frozen 32.45 17.76 0.417
RankE (CLIP) Co-Evol 33.76 ↑1.31 15.21 ↓2.55 0.425 ↑.008
Janus-Pro · 1B
Janus-Pro Base Frozen 33.20 18.95 0.740
SFT Frozen 33.31 26.73 0.739
Std. RL (CLIP) Frozen 33.60 25.59 0.746
RankE (CLIP) Co-Evol 33.86 ↑0.26 25.19 ↓0.40 0.750 ↑.004

Quantitative — HPSv2-based optimization

Gains generalize to non-differentiable rewards: under HPSv2, RankE improves preference alignment in pixel space while preserving compositional generalization on GenEval.

Method Decoder HPSv2 Photo Concept Anime Avg ↑ GenEval ↑
BaseFrozen0.23640.21070.21830.21960.309
SFTFrozen0.22810.22020.22220.22210.374
Std. RL (HPSv2)Frozen0.24660.24350.24360.24510.418
RankE (HPSv2) Co-Evol 0.2492 0.2479 0.2453 0.2531 0.423

Training dynamics

RankE monotonically improves both metrics while standard RL degrades FID as training progresses.

Training dynamics on LlamaGen-XL — CLIP, FID, HPSv2
Training dynamics on LlamaGen-XL. (a) CLIP score, (b) FID, (c) HPSv2 score across training steps. RankE consistently outperforms standard GRPO and, unlike GRPO, monotonically improves FID throughout training.

Qualitative Comparisons


Across matched prompts, the base model frequently misses prompt attributes (color, count, spatial relations); standard RL improves adherence at the cost of visible artifacts — a direct consequence of the frozen decoder processing latents drawn from a distribution it was never trained on. RankE produces images with both faithful attributes and high perceptual quality.

Qualitative comparison: Base vs RL (GRPO) vs RankE
Same prompt, three decoders. RankE yields precise attributes and fine-grained details corresponding to the text prompts. Standard RL (GRPO) often introduces visual artifacts, while RankE significantly reduces them — demonstrating that adapting the decoder to policy-sampled latents is what turns latent-space alignment into pixel-space fidelity.

Citation


@article{Jian2026RankE,
  title={RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution},
  author={Jian, Siyong and Li, Siyuan and Zhang, Luyuan and Wang, Zedong and Jin, Xin and Li, Ying and Tan, Cheng and Wang, Huan},
  journal={arXiv preprint arXiv:2605.21195},
  year={2026}
}