DeepSeek-V4与流形撕裂(转)

DeepSeek-V4与流形撕裂(转)

原文https://substack.com/home/post/p-195402224,侵删。

Single Token Geometry: DeepSeek V4 and Manifold Tearing

单标几何:DeepSeek V4 与流形撕裂

Deep Manifold

Apr 25, 2026

I was preparing Single Token Geometry: Data Complexity, the second entry in this series , when DeepSeek V4 dropped. I set it aside immediately.

What caught my attention wasn’t the benchmark numbers, impressive as they are. It was a paragraph buried in Section 4.2.3 of the technical report:

“We identified that the occurrence of spikes is consistently tied to outliers in the MoE layers, and the routing mechanism itself appears to exacerbate the emergence of these outliers.”

And then, with unusual candor:

“Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now, we are sharing them openly to foster further exploration by the community.”

I appreciate DeepSeek’s transparency here. They found three empirical fixes that work. They admitted they don’t fully understand why. And they published anyway. That intellectual honesty is rare, and it’s exactly the kind of opening that geometric thinking is built for.

This post is my attempt to supply part of what they left open: a geometric account of what training instability is in a system like DeepSeek V4, and why their mitigations work.

The thesis is simple and I want to state it plainly upfront:

Loss spikes in MoE training are not optimization failures. They are manifold tears.

我原本正在准备《单标几何:数据复杂性》这个系列的第二篇, 这时 DeepSeek V4 发布了。于是我立刻把原来的文章放到一边。

真正吸引我注意的,并不是那些令人印象深刻的基准测试数字,而是技术报告第 4.2.3 节中埋着的一段话:

“我们发现,尖峰的出现始终与 MoE 层中的异常值相关,而路由机制本身似乎会加剧这些异常值的产生。”

然后,他们又以一种少见的坦诚写道:

“尽管目前我们对这些方法背后的机制尚缺乏完整的理论理解,但我们仍将它们公开分享,以促进社区进一步探索。”

我非常欣赏 DeepSeek 在这里的透明度。他们找到了三种有效的经验性修复方法。他们承认自己并不完全理解其原因。然后他们仍然选择发表出来。这种智识上的诚实并不常见,而这正是几何思考可以进入的地方。

这篇文章就是我试图补上他们留下的那一部分:为 DeepSeek V4 这样的系统中的训练不稳定性,提供一种几何解释;并说明为什么他们的缓解方法会起作用。

本文的核心论点很简单,我想在开头直接说清楚:

MoE 训练中的损失峰值不是优化失败。它们是流形撕裂。


What Is a Manifold Tear?

Before we can call something a tear, we need to be precise about what a manifold is and what it requires.

manifold is a space that is locally Euclidean. The canonical example is the surface of the Earth: globally curved and non-Euclidean, but zoom in far enough on any point and it looks flat, like a patch of ℝ². More formally, a topological manifold requires that every point has a neighborhood homeomorphic to ℝⁿ. A smooth manifold further requires that the transition maps between overlapping neighborhoods, called charts, are differentiable. A Riemannian manifold additionally equips the space with a metric tensor, giving you notions of distance and curvature.

The key requirement at the foundation of all of this is continuity. Not just continuity :smoothness. The whole apparatus of differential geometry, and crucially the whole apparatus of gradient-based optimization, assumes that the space being traversed is smooth enough to support derivatives everywhere.

A manifold tear is a violation of this assumption. Precisely: a tear occurs when a map between manifold regions fails to be continuous — when a point has no well-defined image, or when nearby points are sent to regions that are not nearby in the target space. A tear is not a large deformation. A large deformation of a manifold is still a manifold. A tear is a discontinuity, the manifold’s local Euclidean structure breaks down at the torn point.

In the context of a transformer’s residual stream, we can define this operationally:

A manifold tear in a transformer’s residual stream is a discontinuity in the layer-to-layer transport map, induced by a discrete routing decision that is inconsistent with the local geometry of the representation manifold at that point.

Now let’s see exactly how that happens in DeepSeek V4 and how the three mitigations each address a different stage of the same failure cascade.

什么是流形撕裂?

在我们把某种现象称为“撕裂”之前,必须先明确:什么是流形?它需要满足什么条件?

流形是一个局部欧几里得的空间。最经典的例子是地球表面:从整体看,它是弯曲的、非欧几里得的;但如果你在任意一点足够放大,它看起来就是平坦的, 像一小片 ℝ²。更形式化地说,一个拓扑流形要求每一个点都有一个邻域,并且这个邻域与 ℝ² 同胚。一个光滑流形进一步要求重叠邻域之间的转换映射, 也就是坐标图之间的转换, 是可微的。黎曼流形则在此基础上进一步为空间配备度量张量,从而赋予距离和曲率的概念。

所有这些结构的基础要求,都是连续性。不只是连续性,而是光滑性。整个微分几何的装置,尤其是整个基于梯度的优化装置,都默认被穿行的空间足够光滑,因而处处可以支持导数。

流形撕裂就是对这一假设的破坏。更准确地说:当流形区域之间的某个映射不再连续时,撕裂就发生, 也就是说,某个点没有良好定义的像,或者彼此相近的点被送到了目标空间中并不相近的区域。撕裂不是大的形变。一个流形即使发生很大的形变,仍然可以是流形。撕裂是不连续性, 在被撕裂的点上,流形的局部欧几里得结构失效了。

在 Transformer 的残差流语境中,我们可以操作性地定义它:

Transformer 残差流中的流形撕裂,是层与层之间传输映射中的一种不连续性;这种不连续性由离散路由决策诱发,而该路由决策与该点处表征流形的局部几何不一致。

现在,让我们具体看看这一点在 DeepSeek V4 中是如何发生的, 以及它的三种缓解方法,如何分别处理同一个失效级联中的不同阶段。


The Geometry of the Residual Stream

Think of a single token passing through a transformer. At each layer, its hidden state is a point in ℝᵈ , a high-dimensional vector. As the token passes through successive layers, this point traces a trajectory through representation space. The claim that this trajectory lives on a manifold is implicit in how we train: gradient descent assumes a smooth loss landscape, backpropagation assumes differentiable operations everywhere, and the expressive power of deep networks comes precisely from learning smooth, structured transformations of this space.

In a standard transformer, every token passes through the same FFN at each layer. The map from layer l to layer l+1 is the same for all tokens. The manifold deforms, but it deforms continuously — the same function applied everywhere.

In a Mixture-of-Experts (MoE) transformer like DeepSeek V4, this changes fundamentally. At each MoE layer, a router examines the token’s hidden state and routes it to one or more experts, a discrete, Top-k selection from hundreds of possible FFNs. The composite map from hidden state to new hidden state is now:

hidden state → routing decision → expert application → new hidden state

The routing decision is discontinuous by construction. It is a discrete selector. A token sitting at position x in representation space gets routed to expert E₁. A token at position x + ε , infinitesimally close, might get routed to expert E₂. These two experts were trained on different regions of the manifold. Their outputs are not guaranteed to be close to each other.

This means the layer-to-layer transport map has discontinuities at routing boundaries. The manifold is already torn, structurally, at every expert boundary. This is not a bug — it is the price of the MoE architecture’s efficiency. But it is a latent geometric fragility that becomes catastrophic under the right conditions.

残差流的几何

想象一个单标正在穿过一个 Transformer。在每一层,它的隐藏状态都是 ℝᵈ中的一个点: 一个高维向量。随着这个单标穿过连续的层,这个点就在表征空间中划出一条轨迹。我们说这条轨迹生活在某个流形上,这一点其实隐含在训练方式之中:梯度下降假设损失景观是光滑的;反向传播假设所有操作处处可微;而深度网络的表达能力,恰恰来自于它能够学习这个空间上的光滑、有结构的变换。

在标准 Transformer 中,每个单标在每一层都会经过同一个 FFN。从第 l 层到第 l+1 层的映射,对所有单标来说都是同一个映射。流形会形变,但它是连续形变, 同一个函数作用在整个空间上。

但在像 DeepSeek V4 这样的 Mixture-of-Experts(MoE)Transformer 中,情况发生了根本变化。在每一个 MoE 层,路由器会检查单标的隐藏状态,并将它路由到一个或多个专家, 也就是从数百个可能的 FFN 中进行一次离散的 Top-k 选择。于是,从隐藏状态到新隐藏状态的复合映射变成了:

隐藏状态  →  路由决策  →  专家作用  →  新的隐藏状态

路由决策在构造上就是不连续的。它是一个离散选择器。位于表征空间中某一点 x 的单标,可能被路由到专家 x + ε。而位于 E₁ 的另一个单标——哪怕它与 x 无限接近——也可能被路由到专家 E₂。这两个专家是在流形的不同区域上训练出来的。它们的输出并不保证彼此接近。

这意味着,层与层之间的传输映射在路由边界处存在不连续性。从结构上说,流形在每一个专家边界处都已经存在撕裂的可能。这并不是一个 bug,而是 MoE 架构效率所付出的代价。但它是一种潜在的几何脆弱性;在合适的条件下,这种脆弱性会变成灾难性的训练不稳定。


The Failure Cascade

The paper’s observation, that spikes are tied to outliers in MoE layers, and that routing exacerbates them, describes a specific failure cascade. Geometrically, it unfolds in three stages:

Stage 1: Local curvature spike. An activation in an MoE expert grows very large. The SwiGLU gate or linear component produces an extreme value. This is not yet a tear — it is a region of very high curvature on the representation manifold. The local geometry is still technically defined, but it is poorly conditioned. Small changes in input produce large changes in output. The optimizer’s gradient step, calibrated for normal curvature, begins to overshoot.

Stage 2: Chart inconsistency. The router, updating synchronously with the backbone, now operates on a shifted manifold. At step t, the backbone parameters θₜ define a representation space Mₜ. But the router, which just updated, is making routing decisions as if the manifold were Mₜ₋₁. A token at position x on Mₜ gets routed as if it were on a manifold that no longer exists. This is a chart inconsistency: the coordinate chart (the routing assignment) no longer matches the local geometry of the point being mapped. The token is being sent to the wrong expert for where it actually lives in the representation space.

Stage 3: Tear amplification. The misrouted token enters an expert that wasn’t trained on its region of the manifold. The expert produces an outlier output — a point far from where the token should be in the next layer’s representation space. This discontinuous jump is the tear. And if the residual mapping matrices are expansive (spectral norm > 1), the tear grows as it propagates through subsequent layers. By layer L, a token that was slightly misrouted has been thrown far from its correct manifold region. The loss spike is the accumulated geometric damage becoming visible in the training objective.

This is the cascade: high curvature → chart inconsistency → routing boundary discontinuity → tear amplification → loss spike.

DeepSeek V4’s three mitigations each interrupt this cascade at a different stage.

失效级联

论文中的观察 残差尖峰与 MoE 层中的异常值相关,而路由机制会加剧这些异常值——描述的是一个具体的失效级联。从几何上看,它分为三个阶段:

阶段一:局部曲率尖峰。MoE 专家中的某个激活值变得非常大。SwiGLU 的门控项或线性项产生了一个极端值。这还不是撕裂: 它是表征流形上一块曲率极高的区域。局部几何在技术上仍然是有定义的,但条件已经很差。输入中的微小变化,会导致输出中的巨大变化。原本按照正常曲率校准的优化器梯度步长,开始发生过冲。

阶段二:坐标图不一致。路由器与主干网络同步更新,于是它现在作用在一个已经发生位移的流形上。在第 t步,主干参数 θₜ 定义了一个表征空间 Mₜ。但刚刚更新过的路由器,却像是在流形 Mₜ₋₁ 上做路由决策。位于 Mₜ上某一点 x 的单标,被当作仍位于一个已经不存在的流形上来路由。这就是坐标图不一致:坐标图,也就是路由分配,已经不再匹配该点被映射时的局部几何。这个单标被送到了一个并不适合它当前表征位置的专家那里。

阶段三:撕裂放大。被错误路由的单标进入了一个并不是在它所在流形区域上训练出来的专家。这个专家产生了一个异常输出, 也就是一个远离该单标在下一层表征空间中本应到达位置的点。这个不连续跳跃,就是撕裂。而如果残差映射矩阵是扩张性的,也就是谱范数大于 1,那么这种撕裂会在后续层传播时不断放大。到第 L层时,一个原本只是轻微错路由的单标,已经被抛离了它正确的流形区域。残差尖峰,就是这种累积的几何损伤在训练目标中的显现。

这就是整个级联:

高曲率  →  坐标图不一致  →  路由边界不连续  →  撕裂放大  →  残差尖峰

DeepSeek V4 的三种缓解方法,正是在这个级联的不同阶段将其打断。


Mitigation 1: SwiGLU Clamping, Bounding Local Curvature

This is the earliest intervention, attacking Stage 1 before the cascade begins.

The SwiGLU activation computes:

SwiGLU(x, g) = x · σ(g) · g

When either the linear component x or the gate g becomes very large, the local curvature of the loss surface spikes. Mathematically, curvature is related to the second derivative of the map — and when activations are extreme, second derivatives become extreme. The optimizer’s gradient step assumes the loss landscape is locally well-approximated by its first-order Taylor expansion. High curvature violates this assumption. The step overshoots. The next gradient is wildly miscalibrated. The cascade is initiated.

DeepSeek V4 clamps the linear component to [−10, 10] and caps the gate at 10. Geometrically, this is a curvature bound: it enforces that no single point in the activation space can develop extreme local curvature. The manifold remains smooth enough at every point that gradient steps stay valid.

Think of it as a chart boundary condition. A chart in differential geometry is only valid within a bounded region — you cannot use a single flat map to cover the entire Earth. SwiGLU clamping is enforcing that activations stay within the region where the local chart (the linear approximation used by the optimizer) remains valid. Step outside that region, and the chart breaks down. The clamp keeps you inside.

The paper notes this works without compromising performance — which makes geometric sense. Clamping doesn’t restrict the manifold’s expressivity; it restricts the curvature of any single point. The manifold can still be highly complex and nonlinear globally. It just can’t have singular points locally.

缓解方法一:SwiGLU Clamping, 约束局部曲率

这是最早发生作用的干预方式,它在级联开始之前就攻击了阶段一。

SwiGLU 激活计算的是:

SwiGLU(x, g) = x · σ(g) · g

当线性项 xxx 或门控项 ggg 变得非常大时,损失曲面的局部曲率就会出现尖峰。从数学上说,曲率与映射的二阶导数相关——而当激活值变得极端时,二阶导数也会变得极端。优化器的梯度步,默认损失景观在局部可以被一阶泰勒展开很好地近似。高曲率破坏了这一假设。于是梯度步发生过冲。下一步的梯度被严重误校准。整个失效级联由此被启动。

DeepSeek V4 将线性项限制在 [−10, 10] 范围内,并将门控项的上界限制为 10。从几何上看,这就是一种曲率约束:它强制激活空间中的任何单个点都不能发展出极端的局部曲率。于是,流形在每个点上都保持足够光滑,使梯度步仍然有效。

可以把它理解为一种坐标图边界条件。在微分几何中,一个坐标图只在某个有界区域内有效——你不能用一张平面地图覆盖整个地球。SwiGLU clamping 强制激活值停留在这样一个区域内:在这个区域中,优化器所使用的局部坐标图,也就是线性近似,仍然有效。一旦走出这个区域,坐标图就会失效。clamp 的作用,就是把你留在这个区域之内。

论文指出,这种方法不会损害模型性能——这在几何上是说得通的。Clamping 并不限制流形的表达能力;它限制的是任何单个点的曲率。流形在全局上仍然可以高度复杂、强非线性。它只是不能在局部产生奇异点。


Mitigation 2: Anticipatory Routing, Preventing Chart Inconsistency

Where SwiGLU clamping addresses the precondition, Anticipatory Routing attacks Stage 2 directly: the moment of chart inconsistency.

Recall the problem: at training step t, the backbone parameters θₜ define the current representation manifold Mₜ. If the router also updates to θₜ simultaneously, it is now making routing decisions based on a manifold that is being constructed in real time. The routing chart and the geometric chart are out of sync.

Anticipatory Routing enforces a temporal consistency condition: at step t, routing decisions are made using the historical parameters θₜ₋Δₜ. The router operates on the snapshot of the manifold that produced the current token representations. In differential geometry terms, this is analogous to a connection — a rule for parallel-transporting objects along the manifold in a consistent way. You don’t route a vector using the geometry at the destination; you route it using the geometry at the source.

The implementation is elegant: the data for step t is fetched in advance at step t − Δt, and routing indices are precomputed and cached. The router never sees the manifold mid-update. The discrete chart (routing assignment) and the continuous chart (representation geometry) remain synchronized.

The dynamic activation is particularly revealing. The system detects loss spikes and then activates Anticipatory Routing for a stabilization period before reverting to standard training. This is the system monitoring for chart inconsistency and applying the consistency condition on demand — a feedback loop that treats geometric misalignment as a detectable, correctable event rather than an inevitable one.

缓解方法二:Anticipatory Routing, 防止坐标图不一致

如果说 SwiGLU clamping 处理的是失效级联的前提条件,那么 Anticipatory Routing 直接攻击的就是阶段二:坐标图不一致发生的那个瞬间。

回忆一下问题所在:在训练第 t 步,主干网络参数 θₜ₋Δₜ 定义了当前的表征流形 Mₜ。如果路由器也同时更新到 θₜ₋Δₜ,那么它就在一个正在实时构造的流形上做路由决策。路由坐标图与几何坐标图因此发生了不同步。

Anticipatory Routing 强制引入一种时间一致性条件:在第 t 步,路由决策使用历史参数 θₜ₋Δₜ 来完成。也就是说,路由器作用在产生当前单标表征的那个流形快照上。用微分几何的话说,这类似于一种联络, 一种沿着流形一致地平行移动对象的规则。你不应该用目的地处的几何来路由一个向量;你应该用源点处的几何来路由它。

它的实现很优雅:第 t 步的数据会在第 θₜ₋Δₜ 步提前取出,路由索引也会被预先计算并缓存。这样,路由器永远不会看到一个正在更新中的流形。离散坐标图,也就是路由分配;以及连续坐标图,也就是表征几何,因而保持同步。

它的动态激活机制尤其值得注意。系统会检测残差尖峰,然后在一段稳定化期间启用 Anticipatory Routing,之后再回到标准训练。这等于系统在监测坐标图不一致,并按需施加一致性条件, 这是一种反馈回路,它把几何错位视为一种可检测、可修正的事件,而不是不可避免的宿命。


Mitigation 3: mHC — Bounding Tear Propagation

The Manifold-Constrained Hyper-Connections (mHC) is the deepest of the three interventions, and the one most explicitly geometric in the paper’s own framing , it’s right there in the name.

Standard Hyper-Connections expand the residual stream width by a factor of n_hc, introducing a residual mapping matrix B_l ∈ ℝⁿ×ⁿ at each layer. The update rule is:

(X_{l+1} = B_l X_l + C_l F_l(A_l X_l))

The paper found that naive HC exhibited numerical instability when stacked, precisely because B_l is unconstrained. An unconstrained matrix can have spectral norm > 1. Spectral norm is the largest singular value, it is the maximum factor by which the matrix can stretch a vector. If ‖B_l‖₂ > 1, the residual mapping is expansive: it stretches the representation space at each layer. A small perturbation, say, a token slightly misrouted at layer 5:gets amplified at layer 6, further amplified at layer 7, and so on. By the time it reaches layer L, the tear has been stretched across a large region of representation space. This is tear amplification.

The Birkhoff polytope constraint fixes this by requiring B_l to be a doubly stochastic matrix:

B_l ∈ M := {M ∈ ℝⁿ×ⁿ M·1ₙ = 1ₙ, 1ₙᵀ·M = 1ₙᵀ, M ≥ 0}

A doubly stochastic matrix has spectral norm bounded by 1. This makes B_l non-expansive: for any two hidden states x and y,

‖B_l·x − B_l·y‖ ≤ ‖x − y‖

The residual mapping cannot increase the distance between any two points. It is a Lipschitz-1 map on the residual stream. A tear introduced at any layer cannot grow as it propagates through subsequent layers. The manifold can be torn, mHC does not remove the routing boundary discontinuities, but it cannot be stretched apart. The damage is contained.

The Sinkhorn-Knopp algorithm that enforces this constraint is itself geometrically beautiful: it performs alternating projections onto two convex constraint sets, row-normalized matrices, and column-normalized matrices, until their intersection (the Birkhoff polytope) is reached. You are iteratively projecting B_l onto the manifold of doubly stochastic matrices. The constraint manifold M is itself a well-studied convex polytope, and Sinkhorn-Knopp is a convergent algorithm for finding the nearest point on it.

Additionally, the input and output mappings A_l and C_l are constrained to be non-negative and bounded via Sigmoid functions. This prevents signal cancellation, another form of geometric pathology where opposing contributions annihilate each other, creating artificial zeros in the representation space that have no geometric meaning.

缓解方法三:mHC——约束撕裂传播

Manifold-Constrained Hyper-Connections(mHC,流形约束超连接)是三种干预中最深层的一种,也是论文自身表述中最明确具有几何含义的一种, 这一点已经直接写在名字里了。

标准 Hyper-Connections 会将残差流的宽度扩展 B_l 倍,并在每一层引入一个残差映射矩阵 B_l ∈ ℝⁿ×ⁿ。其更新规则为:

(X_{l+1} = B_l X_l + C_l F_l(A_l X_l))

论文发现,朴素的 HC 在堆叠多层时会表现出数值不稳定性——原因正是 BlB_lBl 没有受到约束。一个无约束矩阵的谱范数可能大于 1。谱范数是最大奇异值,也就是这个矩阵能够拉伸一个向量的最大倍数。如果 ‖B_l‖₂ > 1,那么残差映射就是扩张性的:它会在每一层拉伸表征空间。一个很小的扰动, 比如某个单标在第 5 层被轻微错误路由, 会在第 6 层被放大,在第 7 层进一步放大,如此继续。等它到达第 L 层时,这个撕裂已经被拉伸到表征空间中的一大片区域。这就是撕裂放大

Birkhoff 多面体约束通过要求 B_l 成为一个双随机矩阵来解决这个问题:

B_l ∈ M := {M ∈ ℝⁿ×ⁿ M·1ₙ = 1ₙ, 1ₙᵀ·M = 1ₙᵀ, M ≥ 0}

一个双随机矩阵的谱范数被 1 所约束。这使得 BlB_lBl 成为非扩张映射:对任意两个隐藏状态 xxx 和 yyy,都有

‖B_l·x − B_l·y‖ ≤ ‖x − y‖

也就是说,残差映射不能增加任意两点之间的距离。它是残差流上的一个 Lipschitz-1 映射。任何一层中引入的撕裂,都不能在后续层传播时继续增大。流形仍然可能被撕裂, mHC 并没有消除路由边界处的不连续性, 但它不能被进一步拉开。损伤被限制住了。

执行这一约束的 Sinkhorn-Knopp 算法本身也具有很漂亮的几何意义:它在两个凸约束集合之间进行交替投影, 行归一化矩阵集合与列归一化矩阵集合, 直到到达它们的交集,也就是 Birkhoff 多面体。换句话说,你是在迭代地将 B_l 投影到双随机矩阵的流形上。这个约束流形 M 本身是一个研究充分的凸多面体,而 Sinkhorn-Knopp 是一种用于找到其上相应约束点的收敛算法。

此外,输入映射 A_l 和输出映射 C_l 也通过 Sigmoid 函数被约束为非负且有界。这可以防止信号抵消, 另一种几何病态:相反方向的贡献彼此湮灭,在表征空间中制造出没有真实几何意义的人为空零点.


The Unified Picture

The three mitigations form a layered geometric defense, each acting at a different stage of the failure cascade:

Stage 1.

  • Geometric Event: High local curvature
  • Mitigation: SwiGLU Clamping
  • Mechanism : Bounds activation magnitude; keeps optimizer’s local chart valid

Stage 2.

  • Geometric Event: Chart inconsistency
  • Mitigation: Anticipatory Routing
  • Mechanism : Synchronizes routing chart with geometric chart via temporal consistency

Stage 3.

  • Geometric Event: Tear amplification
  • Mitigation: mHC (Birkhoff)
  • Mechanism : Non-expansive residual mapping; Lipschitz-1 bound across layers

None of the three individually suffices. SwiGLU clamping reduces the probability of a high-curvature precondition but cannot prevent all routing misalignment. Anticipatory Routing prevents chart inconsistency during normal training but is only activated reactively. mHC contains the damage if a tear occurs but does not prevent it from forming.

Together, they form a cascade interrupt: clamping attacks the precondition, routing consistency attacks the event, and Lipschitz bounding attacks the aftermath. This is why the paper could say, with some confidence, that training was stabilized — even without a complete theoretical account.

统一图景

这三种缓解方法构成了一套分层的几何防御体系,分别作用于失效级联的不同阶段:

阶段 1.

  • 几何事件: 高局部曲率
  • 解方法: SwiGLU Clamping
  • 机制 : 约束激活幅度;保持优化器的局部坐标图有效

阶段 2.

  • 几何事件: 坐标图不一致
  • 解方法: Anticipatory Routing
  • 机制: 通过时间一致性,使路由坐标图与几何坐标图同步

阶段 3.

  • 几何事件: 撕裂放大
  • 解方法: mHC (Birkhoff)
  • 机制: 非扩张残差映射;在层间施加 Lipschitz-1 约束

这三者中的任何一个,单独来看都不足够。SwiGLU clamping 降低了高曲率前提出现的概率,但不能阻止所有路由错位。Anticipatory Routing 可以在正常训练中防止坐标图不一致,但它是反应式激活的。mHC 可以在撕裂发生之后限制损伤,但并不能阻止撕裂本身的形成。

三者合在一起,构成了一种级联中断机制:clamping 攻击前提条件,路由一致性攻击事件本身,而 Lipschitz 约束攻击事后的传播放大。这也解释了为什么论文可以相当有信心地说,训练被稳定住了, 即使他们还没有给出一个完整的理论解释。


What the Paper Leaves Open

The authors write:

“Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now…”

The geometric framing offered here suggests what that theoretical understanding might look like: a formal account of how routing boundary discontinuities propagate through residual streams, how spectral norm bounds on residual mappings contain this propagation, and how temporal consistency conditions in discrete selectors can be enforced as a parallel transport rule.

This is not a complete theory. It is a vocabulary, a set of geometric concepts precise enough to ask the right questions. The manifold is torn by construction in every MoE layer. The question is whether that tear propagates catastrophically or stays bounded. DeepSeek V4’s three mitigations are three answers to that question, each operating at a different scale.

Single Token Geometry is the project of taking seriously the idea that what happens to one token, passing through one forward pass, is already geometrically rich enough to explain phenomena we currently attribute to optimization dynamics. Loss spikes are one example.

论文留下了什么开放问题

作者写道:

“尽管目前我们对这些方法背后的机制尚缺乏完整的理论理解……”

这里提出的几何框架,提示了这种理论理解可能是什么样子:它可以是一种形式化解释,用来说明路由边界不连续性如何沿着残差流传播,残差映射上的谱范数约束如何限制这种传播,以及离散选择器中的时间一致性条件如何被理解为一种平行移动规则。

这还不是一个完整理论。它是一套词汇, 一组足够精确的几何概念,使我们能够提出正确的问题。每一个 MoE 层在结构上都会撕裂流形。真正的问题是:这种撕裂会灾难性地传播,还是会被约束在有界范围内?DeepSeek V4 的三种缓解方法,正是对这个问题的三种回答,并且分别作用在不同尺度上。

单标几何这个项目,就是认真对待这样一个想法:一个单标在一次前向传播中所经历的过程,本身就已经具有足够丰富的几何结构,足以解释许多我们目前归因于优化动力学的现象。残差尖峰只是其中一个例子。


Manifold Tearing Is Not DeepSeek’s Problem Alone

It is worth being precise about scope. Manifold tearing is not unique to DeepSeek, or to MoE architectures — though MoE does make it worse, by introducing routing boundaries that are discontinuous by construction. The pathology is more general.

We noticed and discussed this in our 2024 paper, Deep Manifold Part 1: Anatomy of Neural Network Manifold — specifically Section 3.6, Learning Transformation:

“This explains the zig-zag pattern observed in the loss curve during the slow decline stage of almost all foundation model training, suggesting that training is struggling to converge — a hidden bottleneck identified in our analysis. A model is considered to have converged effectively when the standard deviation of its loss values stabilizes at less than 5.”

That zig-zag pattern is not noise. It is not a quirk of the optimizer or the learning rate schedule. High loss deviation is an indication of manifold tearing — the loss surface is not smooth but discontinuous, and the optimizer is crossing those discontinuities rather than descending through them. The standard deviation of the loss is, in this framing, a roughness measure of the representation manifold. When it stabilizes, the manifold has settled into a geometry the optimizer can navigate. When it spikes, the manifold is tearing.

This means manifold tearing is visible in almost every foundation model training run ever published. It has been attributed to many things — learning rate warmup, batch size scheduling, data curriculum, optimizer hyperparameters. These attributions are not wrong. But they are proximate causes. They describe the conditions under which tearing is more or less likely. They do not identify the root cause.

The root cause, beyond model architecture, is data.

Specifically: data complexity. The representation manifold that a model learns is not chosen by the architect , it is induced by the training data. A dataset with discontinuous structure,sharp distributional boundaries between domains, conflicting label geometries, or extreme token frequency imbalances — induces a representation manifold with discontinuities baked in from the first forward pass. The model is not tearing a smooth manifold. It is trying to learn a manifold that was never smooth to begin with. The routing boundaries in MoE are a second-order effect; the data boundaries are primary.

This is the argument of the next article in this series.

Single Token Geometry: Data Complexity will ask what it means, geometrically, for data to be complicit in manifold tearing: why certain data compositions make smooth representation manifolds impossible, and what that implies for how we should think about dataset curation, not as an engineering convenience, but as a geometric necessity.

Manifold tearing will continue to appear throughout this series. DeepSeek V4 gave us a precise, honest, and unusually well-documented instance of it. The mitigations they found are real and they work. But they are defenses against a pathology whose origin sits upstream of the architecture — in the data the model is asked to learn from, and in the geometry that data imposes on the representation space before training even begins.

流形撕裂并不只是 DeepSeek 的问题

有必要准确界定一下范围。流形撕裂并不是 DeepSeek 独有的问题,也并不是 MoE 架构独有的问题——尽管 MoE 通过引入构造上不连续的路由边界,确实会让这个问题更加严重。这个病态现象其实更一般。

我们在 2024 年的论文《Deep Manifold Part 1: Anatomy of Neural Network Manifold》中已经注意并讨论过这一点,尤其是在第 3.6 节“Learning Transformation”中:

这解释了几乎所有基础模型训练在缓慢下降阶段的残差曲线中所观察到的锯齿形模式,表明训练正在艰难收敛——这是我们分析中识别出的一个隐藏瓶颈。当残差值的标准差稳定在小于 5 时,可以认为模型已经有效收敛。

这种锯齿形模式不是噪声。它不是优化器或学习率调度的某种偶然现象。高残差偏差是流形撕裂的迹象, 损失曲面不是光滑的,而是不连续的;优化器不是沿着它下降,而是在穿越这些不连续处。在这个框架中,残差的标准差可以被看作表征流形的一个粗糙度度量。当它稳定下来时,说明流形已经沉降到一种优化器可以导航的几何之中。当它出现尖峰时,说明流形正在撕裂。

这意味着,流形撕裂几乎可以在每一次公开发表的基础模型训练过程中看到。它过去被归因于很多事情, 学习率 warmup、batch size 调度、数据课程、优化器超参数。这些归因并不是错的。但它们只是近因。它们描述的是在什么条件下撕裂更容易或更不容易发生,却没有指出根本原因。

架构之外的根本原因,是数据。

更具体地说:是数据复杂性。模型所学习到的表征流形,并不是由架构师自由选择的,它是由训练数据诱导出来的。一个具有不连续结构的数据集,例如领域之间存在尖锐的分布边界、标签几何彼此冲突,或者单标频率存在极端不平衡, 会从第一次前向传播开始,就诱导出一个内部带有不连续性的表征流形。模型并不是在撕裂一个原本光滑的流形。它是在试图学习一个从一开始就并不光滑的流形。MoE 中的路由边界是二阶效应;数据边界才是一阶原因。

这正是本系列下一篇文章要讨论的问题。

《单标几何:数据复杂性》将追问:从几何上看,数据如何“共谋”参与流形撕裂?为什么某些数据组合会让光滑的表征流形变得不可能?这又意味着我们应该如何重新理解数据集清洗与构造, 它不只是工程上的便利,而是一种几何上的必要。

流形撕裂将会在这个系列中反复出现。DeepSeek V4 给了我们一个精确、诚实、而且罕见地记录充分的案例。他们找到的缓解方法是真实有效的。但这些方法所防御的病态,其源头位于架构之前, 位于模型被要求学习的数据之中,也位于这些数据在训练开始之前就施加到表征空间上的几何之中。

###

This post is licensed under CC BY 4.0 by the author.