The Path Not Taken, RL vs SFT
Updated 18 May 2026
https://arxiv.org/abs/2511.08567
Good summary: https://blog.yuandong-tian.com/blog/posts/2025_summary_1_en
SFT causes overfitting and catastrophic forgetting. Apparantly we blame that the training data isn’t “on-policy” enough. However, a deeper reason is that the principal components of the weights are directly and heavily modified by external data, causing the “foundation” of the model to become unstable, leading to a significant performance drop.
RL, because it trains with on-policy data, it leaves the principal components of the weights unchanged, but only alters the minor ones. This avoids the problem of catastrophic forgetting, and the distribution of weight deltas tends to be sparser (especially under bf16 quantization).
I liked the image in the paper of going through a mountain versus finding a path around.
I think the most challenging and most difficulty part of interpretability is how to achieve first principle explanation: Starting from the intrinsic structure of the data itself, how and why the models converge into these decoupled, sparse, low-rank, modular, and composable emergent features and circuits? How are these emergent structures related to the model architecture, the optimization algorithms, and the hyperparameters of model training?
low rank makes lora work. It also makes me feel that models can be compressed a lot.