Rotary Positional Embedding
Rope (Rotary Positional Embedding) is a type of relative positional encoding. Currently, most mainstream large models use Rope or its variants. The original paper on Rope can be found in RoFormer: Enhanced Transformer with Rotary Position Embedding.
Since self-attention computation is position-independent, positional encoding has been added since the invention of the transformer to capture dependencies between different positions. The transformer uses absolute positional encoding.
Because absolute positional encoding is directly added to the token embedding, it cannot directly model the relative positions between tokens. The inference performance on sequences exceeding the training data length drops sharply. Relative positional encoding is used to correct this problem, and Rope has become the mainstream approach.
Introduction to RoPE
The core idea of Rope is to find a positional encoding function such that the following equation holds:
That is, when calculating the dot product of and during attention, the result is independent of the absolute positions and of the tokens in and , and only depends on the relative position .
When the embedding dimension is only 2, the following formulas precisely satisfy the above property:
The proof of the above equation involves the application of complex exponential functions, mainly relying on the following three properties:
Using the above three formulas, the Rope formula above can be easily derived.
According to Euler's formula:
Expanding this, we can obtain , which is consistent for both and :
Actually, we can understand the above Rope formula from another more intuitive perspective. The geometric meaning of the dot product of two-dimensional vectors is the product of their lengths multiplied by the cosine of the angle between them. The above Rope positional encoding function is equivalent to rotating the vector while keeping its length unchanged. Therefore, calculating the dot product of two rotated vectors only involves the relative rotation angle and is independent of the absolute angles.
Once the scenario is understood, it becomes relatively easy to understand the scenario where is any even number. The embedding dimension is divided into pairs, and different are applied according to the pair number, resulting in the complete Rope positional encoding function.
The core concept of RoPE is to rotate the embedding vectors based on the token's position. This is achieved by applying a rotation matrix to the token's embedding, where the rotation angle is determined by the token's position in the sequence. By rotating the embeddings instead of using fixed position encodings, the model can maintain more flexible and continuous position information.
Introduction to YaRN
Although Rope apply relative position embedding, it is still limit in generalizing past the context windows seen during training. Several extension methods were proposed. YaRN is the most popular one between performance and complexity. The original paper can be found in YaRN: Efficient Context Window Extension of Large Language Models.
The key point is that when is small enough, it rotate slow, the max rotated angle is . if , it didn't go through the whole cycle during training, so we should interpolate the . If , it can safely extrapolate. A linear translation is applied between the two conditions.
YaRN also add a scale weight to softmax, which is:
It is just an experience value without theory support.
Open Source Implementation
Transformer-Engine
class FusedRoPEFunc(torch.autograd.Function):
def forward(
ctx,
t: torch.Tensor,
freqs: torch.Tensor,
tensor_format: str = "sbhd",
interleaved: bool = False,
cu_seqlens: Union[torch.Tensor, None] = None,
cp_size: int = 1,
cp_rank: int = 0,
) -> torch.Tensor:
def backward(ctx, grad_output: torch.Tensor) -> Tuple[Union[torch.Tensor, None], ...]:
Flash-Attention
class ApplyRotaryEmb(torch.autograd.Function):
def forward(
ctx,
x,
cos,
sin,
interleaved=False,
inplace=False,
seqlen_offsets: Union[int, torch.Tensor] = 0,
cu_seqlens: Optional[torch.Tensor] = None,
max_seqlen: Optional[int] = None,
):
def backward(ctx, do):