Fast-Slow Training：让 LLM 的"参数"和"上下文"一起学

核心摘要

LLM post-training 现在的主流叙事是这样的：先用 SFT 或 RL 更新参数，然后再做 prompt optimization。这个流程把"参数学习"和"提示调优"当成两个独立 phase。这篇论文 argue 这是错的——它们应该同时优化。

Fast-Slow Training (FST) 的核心思想很简单：

Slow weights：传统的 model parameter $\theta$，通过 RL（具体用 GRPO + cispo loss）更新，每步变化很小，跨任务持久
Fast weights：textual context $\Phi$（一个 Pareto-frontier prompt pool），通过 GEPA（一种反射式 prompt 优化）更新，能从 rollout 的 text feedback（thoughts、tool calls、errors）快速学到任务信号

两者交错更新、共同进化。RL 看到的是被 fast weights 已经优化过的 context；GEPA 看到的是不断 RL 改进的 policy。

实验结果（5 个 RLVR 任务 + 4 个不同 base model）：

数据效率：FST 达到 RL 同等 reward 用 3× 更少 rollouts，最终 ceiling 也更高
更小的参数漂移：相同 reward 下，KL-to-base 比 RL-only 低 70%——意味着 catastrophic forgetting 显著减少
保留 plasticity：在 task A 训完后接着训 task B，RL 训的模型在 task B 上塌到接近 0%，而 FST 训的模型还能继续 adapt
支持 continual learning：task 切换的场景下，FST 持续 acquire 新任务，RL-only stall 不动

读完我的判断：这是 2026 年 RL post-training 领域最值得关注的一篇方法论文。它不是又涨了几个点的渐进改进，而是给出了一个让 RL 训练更省、更稳、且能持续学习的新 paradigm。

论文信息

标题：Learning, Fast and Slow: Towards LLMs That Adapt Continually
作者：Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal, Joseph E. Gonzalez, Matei Zaharia, Kurt Keutzer, Inderjit S Dhillon, Rishabh Agarwal, Devvrit Khatri
机构：UC Berkeley、Mila、UT Austin（这是个非常 heavy 的合作阵容，Joseph Gonzalez 和 Matei Zaharia 在）
arXiv：https://arxiv.org/abs/2605.12484

为什么需要 Fast-Slow？

LLM post-training 的两个 fundamental limitations，论文讲得非常清楚：

问题 1：所有适应都被强制写进参数

当你用 RL 训 LLM 解数学题，任何改进——可复用的推理技能、task-specific 启发式、甚至只是某个 rollout 的临时教训——都被压缩进参数 $\theta$ 这同一个 channel。

后果：

远离 base model：参数漂移导致 OOD 泛化下降
熵下降：policy 越来越 sharp，探索能力丢失
Plasticity loss：参数被"用掉"，后续学新任务困难

这是个 architectural 问题——只有一个 channel 承载所有 adaptation，必然 overload。

问题 2：Prompt 这个 channel 被浪费了

LLM 系统其实有另一个 powerful 的 adaptation channel：prompt、instruction、context。它们：

修改成本极低（不用 backprop）
可以 per-task 设计
立即生效，不用等 gradient

但现在的工程实践把它们当成"先训模型，最后调下 prompt"——prompt 的 adaptive capacity 没被充分利用。

Insight：让两个 channel 同时承载 adaptation

人脑在不同时间尺度上学习（System 1 vs 2，海马 vs 皮层）。神经网络里也有 fast-weights/slow-weights 的悠久传统（Hinton 1987 起）。

FST 的提议：把 LLM 的 prompt 当 fast weights，参数当 slow weights，让它们 co-evolve。

图1：FST 框架总览

图1：FST 的两个交错循环。Slow loop（上半）：用 scalar reward 更新参数 $\theta_c \to \theta_{c+1}$。Fast loop（下半）：用 reflective optimization 更新 prompt pool $\Phi_c \to \Phi_{c+1}$，输入是 rollout 的完整文本（thoughts、tool calls、errors、rich feedback）。Pareto-frontier population（不是单个 best prompt）让不同 prompt 专精不同问题切片，给 slow update 提供丰富的 conditioning。

注意一个关键设计：fast channel 接收的是 text feedback，不只是 scalar reward。GEPA 能 read rollout 里的错误信息、tool error message——这些 scalar reward 完全 throw away 的信息，在 prompt 优化里被充分利用。

方法：RL + GEPA 的交错训练

数学 formulation

模型用 $\pi_\theta(y | x, \phi)$，response 是从 parameter $\theta$ + textual context $\phi$ + query $x$ 联合 condition 出来的。Joint objective：

\[\max_{\theta, \phi} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot | x, \phi)} [r(x, y)]\]

注意 $\theta$ 和 $\phi$ 联合优化，不是 sequential。

Slow loop：GRPO + cispo

具体 RL 算法用 ScaleRL recipe + GRPO + cispo loss。每个 query $x$ 采样 $G$ 个 rollout，计算 group-relative advantage：

\[A_i = \frac{r(x, y_i) - \bar{r}_g}{\sigma_g + \varepsilon}\]

参数 update 用 truncated importance-sampling REINFORCE。这部分就是标准 RLVR。

Fast loop：GEPA reflective optimization

GEPA（Genetic-Pareto reflective prompt optimization）是这篇论文的关键 component。简单说：

维护一个 $K$ 个 prompt 的 Pareto-frontier population
每 $T$ 个 RL step 触发一次 fast update
让一个 reflection LM（论文用 GPT-5.2）读 rollout 的 text，分析失败原因，提出改进的 prompt
用新 prompt evaluate，更新 Pareto frontier

为什么是 Pareto frontier 而不是单 best prompt？Diversity——不同 prompt 在不同问题切片上表现不同，给 slow update 提供更丰富的 conditioning distribution。

交错策略

每 $T=6$ RL step 后做一次 GEPA cycle
GEPA cycle 结束后输出 $K$ 个 prompt 的 Pareto frontier
下一组 RL step：每个 problem 的 $G=8$ rollout 拆给 $K$ 个 prompt（每 prompt $G/K$ rollouts）
这样 RL 看到的 batch 自然包含 prompt diversity

工程上还有一个 nice trick：rollout reuse——把 GEPA 评估时的 rollout cache 起来给 RL 用，让 FST 每步 wall-clock 反而比 RL-only 略快（$\sim$47s vs $\sim$60s）。

实验：5 个领域 × 3 类优势

实验非常 thorough——5 个 task domain（HoVer-hard、Physics、CodeIO、Math/Polaris、Star-graph），3-4 个 base model（Qwen3-8B think/instruct、Qwen3-4B-Instruct），每个 task 都对比 FST vs RL-only。

优势 1：数据效率

图2：FST vs RL-only 在多个 task 上的训练曲线

图2：FST（蓝色）vs RL-only（橙色）在 HoVer-hard、Physics、CodeIO、Math、Star-graph 上的训练曲线。FST 在所有任务上 reward 都更快上升，且达到更高 ceiling。在某些 task 上达到 RL 同等 reward 需要 3× 更少 rollouts。

优势 2：更高 performance ceiling

不只是更快，FST 的最终 reward 也比 RL-only 高。这有点反直觉——RL 训得久也能拿到 fast channel 同等的提升吗？

论文的解释：text channel 编码的某些 task signal 在 parameter 里很难表达（如复杂的 task-specific instruction），强迫 parameter 学这些会 sacrifice general reasoning capability。FST 让 prompt 承载这部分，parameter 专注通用能力，整体 ceiling 更高。

优势 3：更小的 KL drift

图3：相同 reward 下的 KL-to-base 对比

图3：FST 和 RL-only 在相同 reward 下的 KL divergence 对比。FST 的参数离 base model 近 70%——意味着 catastrophic forgetting 显著减少。

这个 finding 直接 actionable：如果你担心 RL 训完模型在 unrelated task 上崩塌，用 FST 能 retain 更多 base 能力。

优势 4：Plasticity preservation

最 striking 的实验：先在 task A 训，再从 checkpoint 接着训 task B：

图4：Sequential 训练，A→B 的 plasticity 对比

图4：RL-only 训完 task A 的 checkpoint 在 task B 上几乎学不动（reward 塌到接近 0）。FST 的 checkpoint 还能继续 adapt 到 task B，最终 reward 接近从头训的水平。Plasticity 几乎完全保留。

这是 catastrophic forgetting + plasticity loss 这两个长期顽疾的一个 elegant 解药。RL 让参数过度 specialize 到 task A，FST 让 specialization 主要发生在 prompt 上，参数保持通用性。

优势 5：Continual Learning

图5：Continual learning 场景，task 动态切换

图5：训练过程中 task 动态切换的场景。FST 在每次 task 切换后能快速 acquire 新任务，RL-only 在第一次切换后 stall——参数无法继续 adapt。

消融与细节

论文做了 4 组 design ablation（CodeIO 上 step 500 的 mean@4）：

图6：CodeIO 上的 design ablation

图6：(a) Population size $K \in \{1, 2, 4, 8\}$——even $K=1$ 都比 RL-only 强，$K=8$ + cycle 6 达到最佳 42.84%。(b) Advantage baseline 选择：per-problem 比 per-prompt 好（混合 prompt 的 group statistic 更稳定）。(c) Cycle length $T$：太长 GEPA 信号过时，太短 RL 没收益。(d) Light vs full GEPA recipe。

关键发现：

$K=1$ 已经有 +1.5pp——意味着 even a single optimized prompt 就能 boost RL
Population diversity 有 marginal benefit：$K=8$ 比 $K=1$ 多 +1.5pp 左右
Cycle length 不能太长：12 step cycle 损失了 $K=8$ 的大部分收益（41.13% vs 42.84%）

这告诉我们 fast-slow 的核心增益主要来自有这个 fast channel 存在，diversity 是 secondary。

Compute cost

Single-node 8× H100，per-RL-step wall-clock：RL-only ~60s，FST without rollout reuse ~100s，FST with rollout reuse ~47s
但 FST 还要额外跑 GEPA cycle 的 rollout + reflection call
一个 headline FST run 总成本 25-40 H100-GPU-hours
GEPA 用 gpt-5.2 做 reflection，per run 成本 <$10

总的来说：FST 比 RL-only 计算上贵一些（GEPA cycle 是额外成本），但因为 3× 数据效率，end-to-end 训到同等 reward 反而更省。

我的判断：值不值得读？

强推，特别是如果你在做 RL post-training、catastrophic forgetting、continual learning。

亮点：

范式 elegant：把 prompt 和 parameter 当成 two channels of adaptation——这个 framing 一旦说出来就 obvious in hindsight，但之前业界一直分开处理
直接 actionable：FST 比 RL-only 训得更快、更稳、更省——三个 dimension 同时改进非常罕见
plasticity preservation 实验做得 clean：A→B sequential 训练 + RL-only 几乎完全塌——这个 demo 把 plasticity loss 问题暴露得很彻底
5 个 domain × 4 个 base model：generalization 证据充分
Engineering trick 完整：rollout reuse、Pareto population、cycle length tuning 都做了 ablation

问题/局限：

依赖 strong reflection LM：GEPA 用 GPT-5.2 做 reflection。如果 reflection LM 不够强，fast channel 的 signal quality 大打折扣
额外的 reflection LM 成本：虽然 <$10/run 不贵，但 production-scale training 可能累积可观
Inference 时 prompt 仍然需要被用：模型部署时还要把 optimized prompt 注入到 context——增加了 inference latency。论文没讨论这个 deployment cost
没和 LoRA/adapter 对比：LoRA 也是一种"额外 channel of adaptation"，FST 应该和 LoRA-based continual learning 直接对比，论文没做
Continual learning 实验比较小规模：task 数量有限，看不到长期的 cumulative interference

对工程实践的启发：

如果你在 fine-tune LLM 给 specific task：试试 RL + GEPA 一起跑，可能 3× 数据效率
如果你担心 fine-tune 后模型在 unrelated task 上塌：FST 能显著降低 KL drift
如果你需要 sequential/continual fine-tune：RL-only 几乎不可行（plasticity 完全丢失），FST 是目前最好的方案
更宏观的 takeaway：post-training 不该是"parameter learning followed by prompt tuning"，应该是 joint optimization over multiple adaptation channels

收尾

这篇论文给整个 RL post-training 社区一个很 fundamental 的提醒：adaptation 不必只通过参数 channel。

回头看 fine-tune 的演化：

Full fine-tune：所有参数都更新——slow but powerful
LoRA：低秩 adapter——参数高效，但还是只有一个 channel
Prompt tuning：只调 prompt——cheap but limited
FST：parameter + prompt 同时优化——综合两者优点

我觉得 FST 这个 framing 长期会成为 RL post-training 的 default。原因：

它的实证 benefit 在多个 dimension（efficiency、stability、plasticity）上都有
它的核心 insight（多 channel adaptation）是架构层面的根本改进，不是某个 trick
它和 prompt optimization 社区（GEPA、DSPy、APE 等）的工作天然 compose

下一步值得追的问题：

能不能把 fast channel 进一步细分？例如同时有 instruction + few-shot examples + retrieval context 三个 fast channel
Reflection LM 能不能用 self？现在用 external GPT-5.2，如果用自己做 reflection 会怎样
能不能在 inference 时 dynamically pick prompt？根据 query 类型 retrieve 不同 prompt
和 in-context learning 的关系？fast channel 本质就是 learned in-context conditioning，能不能 unify

如果你在做 LLM training，强烈建议把这篇放在 must-read 列表。这是少数能改变 default training recipe 的工作。

觉得有启发的话，欢迎点赞、在看、转发。跟进最新 AI 前沿，关注我