DeepSeek-V4与流形撕裂(转)
原文https://substack.com/home/post/p-195402224,侵删。
Single Token Geometry: DeepSeek V4 and Manifold Tearing
单标几何:DeepSeek V4 与流形撕裂
Apr 25, 2026
I was preparing Single Token Geometry: Data Complexity, the second entry in this series , when DeepSeek V4 dropped. I set it aside immediately.
What caught my attention wasn’t the benchmark numbers, impressive as they are. It was a paragraph buried in Section 4.2.3 of the technical report:
“We identified that the occurrence of spikes is consistently tied to outliers in the MoE layers, and the routing mechanism itself appears to exacerbate the emergence of these outliers.”
And then, with unusual candor:
“Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now, we are sharing them openly to foster further exploration by the community.”
I appreciate DeepSeek’s transparency here. They found three empirical fixes that work. They admitted they don’t fully understand why. And they published anyway. That intellectual honesty is rare, and it’s exactly the kind of opening that geometric thinking is built for.
This post is my attempt to supply part of what they left open: a geometric account of what training instability is in a system like DeepSeek V4, and why their mitigations work.
The thesis is simple and I want to state it plainly upfront:
Loss spikes in MoE training are not optimization failures. They are manifold tears.
我原本正在准备《单标几何:数据复杂性》这个系列的第二篇, 这时 DeepSeek V4 发布了。于是我立刻把原来的文章放到一边。
真正吸引我注意的,并不是那些令人印象深刻的基准测试数字,而是技术报告第 4.2.3 节中埋着的一段话:
“我们发现,尖峰的出现始终与 MoE 层中的异常值相关,而路由机制本身似乎会加剧这些异常值的产生。”
然后,他们又以一种少见的坦诚写道:
“尽管目前我们对这些方法背后的机制尚缺乏完整的理论理解,但我们仍将它们公开分享,以促进社区进一步探索。”
我非常欣赏 DeepSeek 在这里的透明度。他们找到了三种有效的经验性修复方法。他们承认自己并不完全理解其原因。然后他们仍然选择发表出来。这种智识上的诚实并不常见,而这正是几何思考可以进入的地方。
这篇文章就是我试图补上他们留下的那一部分:为 DeepSeek V4 这样的系统中的训练不稳定性,提供一种几何解释;并说明为什么他们的缓解方法会起作用。
本文的核心论点很简单,我想在开头直接说清楚:
MoE 训练中的损失峰值不是优化失败。它们是流形撕裂。
What Is a Manifold Tear?
Before we can call something a tear, we need to be precise about what a manifold is and what it requires.
A manifold is a space that is locally Euclidean. The canonical example is the surface of the Earth: globally curved and non-Euclidean, but zoom in far enough on any point and it looks flat, like a patch of ℝ². More formally, a topological manifold requires that every point has a neighborhood homeomorphic to ℝⁿ. A smooth manifold further requires that the transition maps between overlapping neighborhoods, called charts, are differentiable. A Riemannian manifold additionally equips the space with a metric tensor, giving you notions of distance and curvature.
The key requirement at the foundation of all of this is continuity. Not just continuity :smoothness. The whole apparatus of differential geometry, and crucially the whole apparatus of gradient-based optimization, assumes that the space being traversed is smooth enough to support derivatives everywhere.
A manifold tear is a violation of this assumption. Precisely: a tear occurs when a map between manifold regions fails to be continuous — when a point has no well-defined image, or when nearby points are sent to regions that are not nearby in the target space. A tear is not a large deformation. A large deformation of a manifold is still a manifold. A tear is a discontinuity, the manifold’s local Euclidean structure breaks down at the torn point.
In the context of a transformer’s residual stream, we can define this operationally:
A manifold tear in a transformer’s residual stream is a discontinuity in the layer-to-layer transport map, induced by a discrete routing decision that is inconsistent with the local geometry of the representation manifold at that point.
Now let’s see exactly how that happens in DeepSeek V4 and how the three mitigations each address a different stage of the same failure cascade.
什么是流形撕裂?
在我们把某种现象称为“撕裂”之前,必须先明确:什么是流形?它需要满足什么条件?
流形是一个局部欧几里得的空间。最经典的例子是地球表面:从整体看,它是弯曲的、非欧几里得的;但如果你在任意一点足够放大,它看起来就是平坦的, 像一小片 ℝ²。更形式化地说,一个拓扑流形要求每一个点都有一个邻域,并且这个邻域与 ℝ² 同胚。一个光滑流形进一步要求重叠邻域之间的转换映射, 也就是坐标图之间的转换, 是可微的。黎曼流形则在此基础上进一步为空间配备度量张量,从而赋予距离和曲率的概念。
所有这些结构的基础要求,都是连续性。不只是连续性,而是光滑性。整个微分几何的装置,尤其是整个基于梯度的优化装置,都默认被穿行的空间足够光滑,因而处处可以支持导数。
流形撕裂就是对这一假设的破坏。更准确地说:当流形区域之间的某个映射不再连续时,撕裂就发生, 也就是说,某个点没有良好定义的像,或者彼此相近的点被送到了目标空间中并不相近的区域。撕裂不是大的形变。一个流形即使发生很大的形变,仍然可以是流形。撕裂是不连续性, 在被撕裂的点上,流形的局部欧几里得结构失效了。
在 Transformer 的残差流语境中,我们可以操作性地定义它:
Transformer 残差流中的流形撕裂,是层与层之间传输映射中的一种不连续性;这种不连续性由离散路由决策诱发,而该路由决策与该点处表征流形的局部几何不一致。
现在,让我们具体看看这一点在 DeepSeek V4 中是如何发生的, 以及它的三种缓解方法,如何分别处理同一个失效级联中的不同阶段。
The Geometry of the Residual Stream
Think of a single token passing through a transformer. At each layer, its hidden state is a point in ℝᵈ , a high-dimensional vector. As the token passes through successive layers, this point traces a trajectory through representation space. The claim that this trajectory lives on a manifold is implicit in how we train: gradient descent assumes a smooth loss landscape, backpropagation assumes differentiable operations everywhere, and the expressive power of deep networks comes precisely from learning smooth, structured transformations of this space.
In a standard transformer, every token passes through the same FFN at each layer. The map from layer l to layer l+1 is the same for all tokens. The manifold deforms, but it deforms continuously — the same function applied everywhere.
In a Mixture-of-Experts (MoE) transformer like DeepSeek V4, this changes fundamentally. At each MoE layer, a router examines the token’s hidden state and routes it to one or more experts, a discrete, Top-k selection from hundreds of possible FFNs. The composite map from hidden state to new hidden state is now:
hidden state → routing decision → expert application → new hidden state
The routing decision is discontinuous by construction. It is a discrete selector. A token sitting at position x in representation space gets routed to expert E₁. A token at position x + ε , infinitesimally close, might get routed to expert E₂. These two experts were trained on different regions of the manifold. Their outputs are not guaranteed to be close to each other.
This means the layer-to-layer transport map has discontinuities at routing boundaries. The manifold is already torn, structurally, at every expert boundary. This is not a bug — it is the price of the MoE architecture’s efficiency. But it is a latent geometric fragility that becomes catastrophic under the right conditions.
残差流的几何
想象一个单标正在穿过一个 Transformer。在每一层,它的隐藏状态都是 ℝᵈ中的一个点: 一个高维向量。随着这个单标穿过连续的层,这个点就在表征空间中划出一条轨迹。我们说这条轨迹生活在某个流形上,这一点其实隐含在训练方式之中:梯度下降假设损失景观是光滑的;反向传播假设所有操作处处可微;而深度网络的表达能力,恰恰来自于它能够学习这个空间上的光滑、有结构的变换。
在标准 Transformer 中,每个单标在每一层都会经过同一个 FFN。从第 l 层到第 l+1 层的映射,对所有单标来说都是同一个映射。流形会形变,但它是连续形变, 同一个函数作用在整个空间上。
但在像 DeepSeek V4 这样的 Mixture-of-Experts(MoE)Transformer 中,情况发生了根本变化。在每一个 MoE 层,路由器会检查单标的隐藏状态,并将它路由到一个或多个专家, 也就是从数百个可能的 FFN 中进行一次离散的 Top-k 选择。于是,从隐藏状态到新隐藏状态的复合映射变成了:
隐藏状态 → 路由决策 → 专家作用 → 新的隐藏状态
路由决策在构造上就是不连续的。它是一个离散选择器。位于表征空间中某一点 x 的单标,可能被路由到专家 x + ε。而位于 E₁ 的另一个单标——哪怕它与 x 无限接近——也可能被路由到专家 E₂。这两个专家是在流形的不同区域上训练出来的。它们的输出并不保证彼此接近。
这意味着,层与层之间的传输映射在路由边界处存在不连续性。从结构上说,流形在每一个专家边界处都已经存在撕裂的可能。这并不是一个 bug,而是 MoE 架构效率所付出的代价。但它是一种潜在的几何脆弱性;在合适的条件下,这种脆弱性会变成灾难性的训练不稳定。
The Failure Cascade
The paper’s observation, that spikes are tied to outliers in MoE layers, and that routing exacerbates them, describes a specific failure cascade. Geometrically, it unfolds in three stages:
Stage 1: Local curvature spike. An activation in an MoE expert grows very large. The SwiGLU gate or linear component produces an extreme value. This is not yet a tear — it is a region of very high curvature on the representation manifold. The local geometry is still technically defined, but it is poorly conditioned. Small changes in input produce large changes in output. The optimizer’s gradient step, calibrated for normal curvature, begins to overshoot.
Stage 2: Chart inconsistency. The router, updating synchronously with the backbone, now operates on a shifted manifold. At step t, the backbone parameters θₜ define a representation space Mₜ. But the router, which just updated, is making routing decisions as if the manifold were Mₜ₋₁. A token at position x on Mₜ gets routed as if it were on a manifold that no longer exists. This is a chart inconsistency: the coordinate chart (the routing assignment) no longer matches the local geometry of the point being mapped. The token is being sent to the wrong expert for where it actually lives in the representation space.
Stage 3: Tear amplification. The misrouted token enters an expert that wasn’t trained on its region of the manifold. The expert produces an outlier output — a point far from where the token should be in the next layer’s representation space. This discontinuous jump is the tear. And if the residual mapping matrices are expansive (spectral norm > 1), the tear grows as it propagates through subsequent layers. By layer L, a token that was slightly misrouted has been thrown far from its correct manifold region. The loss spike is the accumulated geometric damage becoming visible in the training objective.
This is the cascade: high curvature → chart inconsistency → routing boundary discontinuity → tear amplification → loss spike.
DeepSeek V4’s three mitigations each interrupt this cascade at a different stage.
失效级联
论文中的观察 残差尖峰与 MoE 层中的异常值相关,而路由机制会加剧这些异常值——描述的是一个具体的失效级联。从几何上看,它分为三个阶段:
阶段一:局部曲率尖峰。MoE 专家中的某个激活值变得非常大。SwiGLU 的门控项或线性项产生了一个极端值。这还不是撕裂: 它是表征流形上一块曲率极高的区域。局部几何在技术上仍然是有定义的,但条件已经很差。输入中的微小变化,会导致输出中的巨大变化。原本按照正常曲率校准的优化器梯度步长,开始发生过冲。
阶段二:坐标图不一致。路由器与主干网络同步更新,于是它现在作用在一个已经发生位移的流形上。在第 t步,主干参数 θₜ 定义了一个表征空间 Mₜ。但刚刚更新过的路由器,却像是在流形 Mₜ₋₁ 上做路由决策。位于 Mₜ上某一点 x 的单标,被当作仍位于一个已经不存在的流形上来路由。这就是坐标图不一致:坐标图,也就是路由分配,已经不再匹配该点被映射时的局部几何。这个单标被送到了一个并不适合它当前表征位置的专家那里。
阶段三:撕裂放大。被错误路由的单标进入了一个并不是在它所在流形区域上训练出来的专家。这个专家产生了一个异常输出, 也就是一个远离该单标在下一层表征空间中本应到达位置的点。这个不连续跳跃,就是撕裂。而如果残差映射矩阵是扩张性的,也就是谱范数大于 1,那么这种撕裂会在后续层传播时不断放大。到第 L层时,一个原本只是轻微错路由的单标,已经被抛离了它正确的流形区域。残差尖峰,就是这种累积的几何损伤在训练目标中的显现。
这就是整个级联:
高曲率 → 坐标图不一致 → 路由边界不连续 → 撕裂放大 → 残差尖峰
DeepSeek V4 的三种缓解方法,正是在这个级联的不同阶段将其打断。
Mitigation 1: SwiGLU Clamping, Bounding Local Curvature
This is the earliest intervention, attacking Stage 1 before the cascade begins.
The SwiGLU activation computes:
SwiGLU(x, g) = x · σ(g) · g
When either the linear component x or the gate g becomes very large, the local curvature of the loss surface spikes. Mathematically, curvature is related to the second derivative of the map — and when activations are extreme, second derivatives become extreme. The optimizer’s gradient step assumes the loss landscape is locally well-approximated by its first-order Taylor expansion. High curvature violates this assumption. The step overshoots. The next gradient is wildly miscalibrated. The cascade is initiated.
DeepSeek V4 clamps the linear component to [−10, 10] and caps the gate at 10. Geometrically, this is a curvature bound: it enforces that no single point in the activation space can develop extreme local curvature. The manifold remains smooth enough at every point that gradient steps stay valid.
Think of it as a chart boundary condition. A chart in differential geometry is only valid within a bounded region — you cannot use a single flat map to cover the entire Earth. SwiGLU clamping is enforcing that activations stay within the region where the local chart (the linear approximation used by the optimizer) remains valid. Step outside that region, and the chart breaks down. The clamp keeps you inside.
The paper notes this works without compromising performance — which makes geometric sense. Clamping doesn’t restrict the manifold’s expressivity; it restricts the curvature of any single point. The manifold can still be highly complex and nonlinear globally. It just can’t have singular points locally.
缓解方法一:SwiGLU Clamping, 约束局部曲率
这是最早发生作用的干预方式,它在级联开始之前就攻击了阶段一。
SwiGLU 激活计算的是:
SwiGLU(x, g) = x · σ(g) · g
当线性项 xxx 或门控项 ggg 变得非常大时,损失曲面的局部曲率就会出现尖峰。从数学上说,曲率与映射的二阶导数相关——而当激活值变得极端时,二阶导数也会变得极端。优化器的梯度步,默认损失景观在局部可以被一阶泰勒展开很好地近似。高曲率破坏了这一假设。于是梯度步发生过冲。下一步的梯度被严重误校准。整个失效级联由此被启动。
DeepSeek V4 将线性项限制在 [−10, 10] 范围内,并将门控项的上界限制为 10。从几何上看,这就是一种曲率约束:它强制激活空间中的任何单个点都不能发展出极端的局部曲率。于是,流形在每个点上都保持足够光滑,使梯度步仍然有效。
可以把它理解为一种坐标图边界条件。在微分几何中,一个坐标图只在某个有界区域内有效——你不能用一张平面地图覆盖整个地球。SwiGLU clamping 强制激活值停留在这样一个区域内:在这个区域中,优化器所使用的局部坐标图,也就是线性近似,仍然有效。一旦走出这个区域,坐标图就会失效。clamp 的作用,就是把你留在这个区域之内。
论文指出,这种方法不会损害模型性能——这在几何上是说得通的。Clamping 并不限制流形的表达能力;它限制的是任何单个点的曲率。流形在全局上仍然可以高度复杂、强非线性。它只是不能在局部产生奇异点。
Mitigation 2: Anticipatory Routing, Preventing Chart Inconsistency
Where SwiGLU clamping addresses the precondition, Anticipatory Routing attacks Stage 2 directly: the moment of chart inconsistency.
Recall the problem: at training step t, the backbone parameters θₜ define the current representation manifold Mₜ. If the router also updates to θₜ simultaneously, it is now making routing decisions based on a manifold that is being constructed in real time. The routing chart and the geometric chart are out of sync.
Anticipatory Routing enforces a temporal consistency condition: at step t, routing decisions are made using the historical parameters θₜ₋Δₜ. The router operates on the snapshot of the manifold that produced the current token representations. In differential geometry terms, this is analogous to a connection — a rule for parallel-transporting objects along the manifold in a consistent way. You don’t route a vector using the geometry at the destination; you route it using the geometry at the source.
The implementation is elegant: the data for step t is fetched in advance at step t − Δt, and routing indices are precomputed and cached. The router never sees the manifold mid-update. The discrete chart (routing assignment) and the continuous chart (representation geometry) remain synchronized.
The dynamic activation is particularly revealing. The system detects loss spikes and then activates Anticipatory Routing for a stabilization period before reverting to standard training. This is the system monitoring for chart inconsistency and applying the consistency condition on demand — a feedback loop that treats geometric misalignment as a detectable, correctable event rather than an inevitable one.
缓解方法二:Anticipatory Routing, 防止坐标图不一致
如果说 SwiGLU clamping 处理的是失效级联的前提条件,那么 Anticipatory Routing 直接攻击的就是阶段二:坐标图不一致发生的那个瞬间。
回忆一下问题所在:在训练第 t 步,主干网络参数 θₜ₋Δₜ 定义了当前的表征流形 Mₜ。如果路由器也同时更新到 θₜ₋Δₜ,那么它就在一个正在实时构造的流形上做路由决策。路由坐标图与几何坐标图因此发生了不同步。
Anticipatory Routing 强制引入一种时间一致性条件:在第 t 步,路由决策使用历史参数 θₜ₋Δₜ 来完成。也就是说,路由器作用在产生当前单标表征的那个流形快照上。用微分几何的话说,这类似于一种联络, 一种沿着流形一致地平行移动对象的规则。你不应该用目的地处的几何来路由一个向量;你应该用源点处的几何来路由它。
它的实现很优雅:第 t 步的数据会在第 θₜ₋Δₜ 步提前取出,路由索引也会被预先计算并缓存。这样,路由器永远不会看到一个正在更新中的流形。离散坐标图,也就是路由分配;以及连续坐标图,也就是表征几何,因而保持同步。
它的动态激活机制尤其值得注意。系统会检测残差尖峰,然后在一段稳定化期间启用 Anticipatory Routing,之后再回到标准训练。这等于系统在监测坐标图不一致,并按需施加一致性条件, 这是一种反馈回路,它把几何错位视为一种可检测、可修正的事件,而不是不可避免的宿命。