Skip to main content

Rotary Positional Embedding

· 5 min read

Rope (Rotary Positional Embedding) is a type of relative positional encoding. Currently, most mainstream large models use Rope or its variants. The original paper on Rope can be found in RoFormer: Enhanced Transformer with Rotary Position Embedding.

Since self-attention computation is position-independent, positional encoding has been added since the invention of the transformer to capture dependencies between different positions. The transformer uses absolute positional encoding.

pk,2t=sin(k/100002t/d)pk,2t+1=cos(k/100002t/d)\begin{aligned} p_{k, 2t} &= \sin(k / 10000^{2t / d}) \\ p_{k, 2t+1} &= \cos(k / 10000^{2t / d}) \end{aligned}

Because absolute positional encoding is directly added to the token embedding, it cannot directly model the relative positions between tokens. The inference performance on sequences exceeding the training data length drops sharply. Relative positional encoding is used to correct this problem, and Rope has become the mainstream approach.

Introduction to RoPE

The core idea of Rope is to find a positional encoding function such that the following equation holds:

f(qm,m),f(kn,n)=g(qm,kn,mn)\langle f(\mathbf{q}_m, m), f(\mathbf{k}_n, n) \rangle = g(\mathbf{q}_m, \mathbf{k}_n, m - n)

That is, when calculating the dot product of qq and kk during attention, the result is independent of the absolute positions mm and nn of the tokens in qq and kk, and only depends on the relative position mnm - n.

When the embedding dimension dd is only 2, the following formulas precisely satisfy the above property:

f(qm,m)=(qm)eimθf(kn,n)=(kn)einθg(qm,kn,mn)=Re[(qm)(kn)ei(mn)θ]\begin{aligned} f(\mathbf{q}_m, m) &= (\mathbf{q}_m)e^{im\theta} \\ f(\mathbf{k}_n, n) &= (\mathbf{k}_n)e^{in\theta} \\ g(\mathbf{q}_m, \mathbf{k}_n, m - n) &= \text{Re}[(\mathbf{q}_m)(\mathbf{k}_n)^* e^{i(m-n)\theta}] \end{aligned}

The proof of the above equation involves the application of complex exponential functions, mainly relying on the following three properties:

z1,z2=Re[z1z2](z1z2)=z1z2(eiϕ)=eiϕ\begin{aligned} \langle z_1, z_2 \rangle &= \text{Re}[z_1 z_2^*] \\ (z_1z_2)^* &= z_1^* z_2^* \\ (e^{i\phi})^* &= e^{-i\phi} \end{aligned}

Using the above three formulas, the Rope formula above can be easily derived.

According to Euler's formula:

eiϕ=cos(ϕ)+isin(ϕ)e^{i\phi} = \cos(\phi) + i \sin(\phi)

Expanding this, we can obtain ff, which is consistent for both qq and kk:

f(qm,m)=(cosmθsinmθsinmθcosmθ)(qm(1)qm(2))f(\boldsymbol{q}_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} q_m^{(1)} \\ q_m^{(2)} \end{pmatrix}

Actually, we can understand the above Rope formula from another more intuitive perspective. The geometric meaning of the dot product of two-dimensional vectors is the product of their lengths multiplied by the cosine of the angle between them. The above Rope positional encoding function is equivalent to rotating the vector while keeping its length unchanged. Therefore, calculating the dot product of two rotated vectors only involves the relative rotation angle and is independent of the absolute angles.

Once the d=2d=2 scenario is understood, it becomes relatively easy to understand the scenario where dd is any even number. The embedding dimension is divided into pairs, and different θi=100002(i1)/d,i[1,2,...,d/2]\theta_i = 10000^{-2(i-1)/d}, i \in [1, 2, ..., d/2] are applied according to the pair number, resulting in the complete Rope positional encoding function.

RΘ,md=(cosmθ1sinmθ10000sinmθ1cosmθ1000000cosmθ2sinmθ20000sinmθ2cosmθ2000000cosmθd/2sinmθd/20000sinmθd/2cosmθd/2)\mathbf{R}_{\boldsymbol{\Theta}, m}^{d} = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ \sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \\ 0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\ 0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2} \end{pmatrix}

The core concept of RoPE is to rotate the embedding vectors based on the token's position. This is achieved by applying a rotation matrix to the token's embedding, where the rotation angle is determined by the token's position in the sequence. By rotating the embeddings instead of using fixed position encodings, the model can maintain more flexible and continuous position information.

Introduction to YaRN

Although Rope apply relative position embedding, it is still limit in generalizing past the context windows seen during training. Several extension methods were proposed. YaRN is the most popular one between performance and complexity. The original paper can be found in YaRN: Efficient Context Window Extension of Large Language Models.

θinew=[γi+(1γi)LL]θi,γi={1,ri>τ0,ri<1ri1τ1,others\theta_i^{new} = \left[ \gamma_i + (1 - \gamma_i) \frac{L}{L'} \right] \theta_i, \quad \gamma_i = \begin{cases} 1, & r_i > \tau \\ 0, & r_i < 1 \\ \frac{r_i - 1}{\tau - 1}, & \text{others} \end{cases}

The key point is that when θi\theta_i is small enough, it rotate slow, the max rotated angle is ri=θiL2πr_i = \frac{\theta_i L}{2\pi}. if ri<1r_i < 1, it didn't go through the whole cycle during training, so we should interpolate the θi\theta_i. If ri>τr_i > \tau, it can safely extrapolate. A linear translation is applied between the two conditions.

YaRN also add a scale weight to softmax, which is:

λ=(1+0.1lnLL)2\lambda = \left( 1 + 0.1 \ln \frac{L'}{L} \right)^2

It is just an experience value without theory support.

Open Source Implementation

Transformer-Engine

class FusedRoPEFunc(torch.autograd.Function):
def forward(
ctx,
t: torch.Tensor,
freqs: torch.Tensor,
tensor_format: str = "sbhd",
interleaved: bool = False,
cu_seqlens: Union[torch.Tensor, None] = None,
cp_size: int = 1,
cp_rank: int = 0,
) -> torch.Tensor:
def backward(ctx, grad_output: torch.Tensor) -> Tuple[Union[torch.Tensor, None], ...]:

Flash-Attention

class ApplyRotaryEmb(torch.autograd.Function):
def forward(
ctx,
x,
cos,
sin,
interleaved=False,
inplace=False,
seqlen_offsets: Union[int, torch.Tensor] = 0,
cu_seqlens: Optional[torch.Tensor] = None,
max_seqlen: Optional[int] = None,
):

def backward(ctx, do):