Skip to main content

Patching with Python's Import System

· 2 min read

In many scenarios, we need to apply patches to third-party libraries in Python. A common approach is to use "monkey patching." However, monkey patching is not a perfect solution because it dynamically changes attributes after a module has been imported. Sometimes, the module being modified might have already been imported before the changes take effect, causing the monkey patch to not work as expected.

We need to find a way to modify modules as early as possible. A better method is to leverage Python's import system to achieve this. For detailed documentation on Python's import system, please refer to the official documentation. In short, Python imports a module in three steps:

  1. Search for the module using a Finder.
  2. Create the module using a Loader.
  3. Bind the module in the current namespace.

In step 1, we can hook into sys.meta_path to create a custom finder, which can return a different module specification (module spec) based on a given module name. In step 2, we can create a new loader for a specific module, which replaces certain attributes (functions, classes, variables) of the module before the created module is returned.

Therefore, with this approach, we can replace an entire module or its attributes when the module is first imported. Since sys.modules acts as a cache, each module is created only once. Consequently, after a module is modified, it will never change again, which is exactly what we expect.

Below is an example code:

import importlib
import importlib.abc
import importlib.util
import sys

ALIAS_MAP = {"d": "patch.d"}
REPLACE_MAP = {"pkg.a.a_1": {"a_1_func1": ("patch.pkg.a.a_1", "a_1_func1_wrapper")}}


class AttrReplacingLoader(importlib.abc.Loader):
def __init__(self, original_loader, replace_map):
self.original_loader = original_loader
self.replace_map = replace_map

def create_module(self, spec):
if hasattr(self.original_loader, "create_module"):
return self.original_loader.create_module(spec)
return None

def exec_module(self, module):
self.original_loader.exec_module(module)
to_replace = self.replace_map[module.__spec__.name]
for attr_name, (new_module_name, new_attr_name) in to_replace.items():
new_module = importlib.import_module(new_module_name)
if hasattr(module, attr_name):
setattr(module, attr_name, getattr(new_module, new_attr_name)(getattr(module, attr_name)))


class PatchModuleFinder(importlib.abc.MetaPathFinder):
def __init__(self, alias_map, replace_map):
self.alias_map = alias_map
self.replace_map = replace_map

def find_spec(self, fullname, path, target=None):
if fullname in self.alias_map:
original_name = self.alias_map[fullname]
original_spec = importlib.util.find_spec(original_name)
if original_spec:
return original_spec

if fullname in self.replace_map:
if self in sys.meta_path:
sys.meta_path.remove(self)
original_spec = None
try:
original_spec = importlib.util.find_spec(fullname, path)
finally:
if self not in sys.meta_path:
sys.meta_path.insert(0, self)

if original_spec and original_spec.loader:
return importlib.util.spec_from_loader(
fullname,
AttrReplacingLoader(original_spec.loader, self.replace_map),
origin=original_spec.origin,
is_package=original_spec.submodule_search_locations is not None,
)
return None

sys.meta_path.insert(0, PatchModuleFinder(ALIAS_MAP, REPLACE_MAP))

Training Paradigms and Frameworks for Reasoning Models

· 4 min read

Introduction: Reasoning Models and "Test Time Scaling"

As model size and training datasets continue to expand, the scaling laws traditionally followed by large language model training are gradually revealing limitations, yielding diminishing marginal returns. Concurrently, the inherent shortcomings of traditional training methods, such as inadequate understanding when tackling complex problems requiring deep reasoning, are becoming increasingly apparent.

Represented by research such as OpenAI's o1 model, a new type of "reasoning model" has emerged. A key characteristic of these models is their ability to dynamically adjust the computation time and resources needed for reasoning based on the complexity of the problem. This has led to a new scaling law known as "Test Time Scaling". This capability, dedicating varying depths of "thought" according to problem difficulty, is often compared to the "System 2 thinking" proposed by Daniel Kahneman in "Thinking, Fast and Slow," distinguishing it from the fast, intuitive, immediate responses ("System 1 thinking") of traditional large models. By leveraging this deep, step-by-step thinking ability, reasoning models hold the potential to solve more complex problems that were previously challenging for existing models.

Progress in Open Source Reasoning Models

In the open-source community, DeepSeek-R1 stands out as the first representative model employing such reasoning training techniques. Trained by combining rule-based reinforcement learning and the DRPO algorithm, this model achieved significant results and garnered widespread industry attention upon its release.

Since then, integrating reasoning or thought processes into training has become a major trend for mainstream open-source models. For instance, models like Llama 4, Qwen 3, and DeepSeek-Prover-V2 have all incorporated related reasoning-enhanced techniques into their training strategies. Furthermore, with the continuous iteration of similar models (such as DeepSeek-R2), it is foreseeable that reasoning models will become an important paradigm for large models to further elevate their capability ceilings.

Currently, the potential of reasoning models is far from fully realized. Related research remains at the academic forefront (see the Awesome-Inference-Time-Scaling list), with relevant papers continuously emerging, suggesting its potential to evolve into a new, significant model training paradigm.

Core Algorithms

From an algorithmic standpoint, current training for reasoning models primarily centers on Reinforcement Learning (RL) techniques. These methods are largely consistent with the algorithms used for human preference alignment during the post-training phase of traditional large models.

Mainstream algorithms include:

  • PPO
  • DRPO
  • Rule-based Reinforcement Learning

Simultaneously, the academic community is actively exploring and proposing new RL algorithms, such as RLOO and REINFORCE++. It is predictable that, in the short term, RL algorithms for training reasoning models will continue to undergo rapid development and iteration, without converging soon. This necessitates that training frameworks remain flexible and open.

Reinforcement Learning Training Frameworks

New training paradigms require the support of corresponding training frameworks. Unlike the dominance of frameworks like Megatron-LM in the traditional LLM pre-training domain, the current landscape for large-scale distributed reinforcement learning training frameworks is diverse. Here are several currently popular or noteworthy RL training frameworks:

  • verl (GitHub)
    • Developer: ByteDance
    • Features: Built on Ray, supports integration with mainstream training/inference systems like FSDP, Megatron-LM, vLLM. Designed for easy extension with new RL algorithms and offers good performance.
    • License: Apache License
    • Popularity: GitHub ~7.6k stars
  • OpenRLHF (GitHub)
    • Developer: Open Source Community
    • Features: Built on Ray, integrates DeepSpeed and vLLM, supports multiple RL algorithms.
    • License: Apache-2.0 License
    • Popularity: GitHub ~6.6k stars
  • TRL (GitHub)
    • Developer: Hugging Face
    • Features: Can utilize Accelerate to integrate DeepSpeed for acceleration. Comparatively, less deeply integrated with dedicated inference frameworks, more focused on research and experimental scenarios.
    • License: Apache-2.0 License
    • Popularity: GitHub ~13.6k stars
  • DeepSpeed Chat (GitHub)
    • Developer: Microsoft
    • Features: Implemented based on the DeepSpeed training framework and inference engine.
    • License: Apache-2.0 License
    • Popularity: GitHub ~38.2k stars (Note: star count reflects the entire DeepSpeed project)
  • Nemo-Aligner (GitHub)
    • Developer: NVIDIA
    • Features: Implemented based on Megatron-LM and TensorRT-LLM. Community activity is relatively low at present.
    • License: Apache-2.0 License
    • Popularity: GitHub ~0.7k stars

Framework Trend Analysis: From a technical standpoint, frameworks like verl, which leverage Ray for distributed scheduling while integrating mainstream distributed training libraries (like Megatron-LM) and efficient inference engines (like vLLM), represent a highly promising approach for large-scale model reinforcement learning. This is because they can effectively reuse mature components and adaptation experiences from the existing large model ecosystem.

Unveiling the RoPE Implementations of Llama and DeepSeek

· 9 min read

Almost all open-source models use RoPE (Rotary Position Embedding) based on the same theory from RoFormer: Enhanced Transformer with Rotary Position Embedding. However, there are two ways to implement RoPE: the GPT-J style and the GPT-NeoX style.

GPT-J and GPT-NeoX RoPE Implementations

The GPT-J style is identical to the original RoFormer, using an interleaved method to calculate RoPE. The GPT-NeoX style uses an alternative, non-interleaved method. According to the Eleuther AI blog, they considered the original implementation inefficient and thus improved it by splitting the dimension into two halves (non-interleaved). Note that the GPT-NeoX and GPT-J styles produce different results.

The GPT-NeoX style RoPE calculation is as follows:

import torch

class RotaryEmbedding(torch.nn.Module):
def __init__(self, dim, base=10000):
super().__init__()
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer("inv_freq", inv_freq)

def forward(self, max_seq_len):
seq = torch.arange(max_seq_len, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
freqs = torch.outer(seq, self.inv_freq)
emb = torch.cat((freqs, freqs), dim=-1)
cos = emb.cos()[:, None, None, :]
sin = emb.sin()[:, None, None, :]
return cos, sin

def _rotate_half(x):
x1, x2 = torch.chunk(x, 2, dim=-1)
return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(t, cos, sin):
return t * cos + _rotate_half(t) * sin

The GPT-J style has two ways to implement RoPE, with the complex number method being more intuitive.

Complex number method:

import torch

class RotaryEmbedding(torch.nn.Module):
def __init__(self, dim, base=10000):
super().__init__()
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer("inv_freq", inv_freq)

def forward(self, max_seq_len):
seq = torch.arange(max_seq_len, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
freqs = torch.outer(seq, self.inv_freq)
freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
return freqs_cis[:, None, None, :]

def apply_rotary_pos_emb(t, freqs_cis):
t_ = torch.view_as_complex(t.float().reshape(*t.shape[:-1], -1, 2))
rotated_t_complex = t_ * freqs_cis
return torch.view_as_real(rotated_t_complex).flatten(3)

Traditional method:

import torch

class RotaryEmbedding(torch.nn.Module):
def __init__(self, dim, base=10000):
super().__init__()
inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
self.register_buffer("inv_freq", inv_freq)

def forward(self, max_seq_len):
seq = torch.arange(max_seq_len, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
freqs = torch.outer(seq, self.inv_freq)
emb = torch.stack((freqs, freqs), dim=-1).flatten(start_dim=1)
cos = emb.cos()[:, None, None, :]
sin = emb.sin()[:, None, None, :]
return cos, sin

def _rotate_every_two(x):
x1 = x[:, :, :, ::2]
x2 = x[:, :, :, 1::2]
x_new = torch.stack((-x2, x1), dim=-1)
return x_new.flatten(start_dim=3)

def apply_rotary_pos_emb(t, cos, sin):
return t * cos + _rotate_every_two(t) * sin

Llama and DeepSeek RoPE Implementations

Due to the significant influence of the Hugging Face community, many people believe Llama uses the GPT-NeoX style based on its inference code in the transformers library. However, this is not the case. In Llama's original code, it implements the GPT-J style RoPE using the complex number method. So, why the difference between the two codebases? The answer lies in this issue. In the weight conversion script, they permuted the weights of q_proj and k_proj.

def permute(w, n_heads=n_heads, dim1=dim, dim2=dim):
return w.view(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)

It's not immediately obvious why this works. We will come back to explain everything later.

A similar situation occurred with DeepSeek-V3. In the original code for deepseek-v3, it uses the same complex number method as Llama to compute GPT-J style RoPE (in fact, their code is very similar). Again, in the Hugging Face code for deepseek-v3, it uses a style similar to GPT-NeoX, just like Llama (again, the code is very similar), but with an exception on lines 364 and 367. Yes, it's very similar to the permute function mentioned above.

q = q.view(b, s, h, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)
k = k.view(b, s, h, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)

So, what is the difference between these two implementations? Many people have the same question, as seen in this discussion.

Since RoPE only acts on the last dimension, a simple example helps understand why.

In the GPT-NeoX style, if dim=6dim=6, we have:

[d0d1d2d3d4d5][c0c1c2c0c1c2]+[d3d4d5d0d1d2][s0s1s2s0s1s2]=[d0c0d3s0d1c1d4s1d2c2d5s2d3c0+d0s0d4c1+d1s1d5c2+d2s2]\begin{bmatrix} d_0 \\ d_1 \\ d_2 \\ d_3 \\ d_4 \\ d_5 \end{bmatrix} \cdot \begin{bmatrix} c_0 \\ c_1 \\ c_2 \\ c_0 \\ c_1 \\ c_2 \end{bmatrix} + \begin{bmatrix} -d_3 \\ -d_4 \\ -d_5 \\ d_0 \\ d_1 \\ d_2 \end{bmatrix} \cdot \begin{bmatrix} s_0 \\ s_1 \\ s_2 \\ s_0 \\ s_1 \\ s_2 \\ \end{bmatrix} = \begin{bmatrix} d_0c_0 - d_3s_0 \\ d_1c_1 - d_4s_1 \\ d_2c_2 - d_5s_2 \\ d_3c_0 + d_0s_0 \\ d_4c_1 + d_1s_1 \\ d_5c_2 + d_2s_2 \end{bmatrix}

In the GPT-J style (interleaved), we have:

[d0d1d2d3d4d5][c0c0c1c1c2c2]+[d1d0d3d2d5d4][s0s0s1s1s2s2]=[d0c0d1s0d1c0+d0s0d2c1d3s1d3c1+d2s1d4c2d5s2d5c2+d4s2]\begin{bmatrix} d_0 \\ d_1 \\ d_2 \\ d_3 \\ d_4 \\ d_5 \\ \end{bmatrix} \cdot \begin{bmatrix} c_0 \\ c_0 \\ c_1 \\ c_1 \\ c_2 \\ c_2 \\ \end{bmatrix} + \begin{bmatrix} -d_1 \\ d_0 \\ -d_3 \\ d_2 \\ -d_5 \\ d_4 \\ \end{bmatrix} \cdot \begin{bmatrix} s_0 \\ s_0 \\ s_1 \\ s_1 \\ s_2 \\ s_2 \\ \end{bmatrix} = \begin{bmatrix} d_0c_0 - d_1s_0 \\ d_1c_0 + d_0s_0 \\ d_2c_1 - d_3s_1 \\ d_3c_1 + d_2s_1 \\ d_4c_2 - d_5s_2 \\ d_5c_2 + d_4s_2 \\ \end{bmatrix}

Now, let's look at the DeepSeek style (as implemented in Hugging Face, applying the permutation from the transpose operation before applying the NeoX-style rotation):

First, permute the input vector dd:

dpermuted=[d0d2d4d1d3d5]d_{permuted} = \begin{bmatrix} d_0 \\ d_2 \\ d_4 \\ d_1 \\ d_3 \\ d_5 \\ \end{bmatrix}

Then apply the NeoX-style rotation logic to dpermutedd_{permuted}:

[d0d2d4d1d3d5][c0c1c2c0c1c2]+[d1d3d5d0d2d4][s0s1s2s0s1s2]=[d0c0d1s0d2c1d3s1d4c2d5s2d1c0+d0s0d3c1+d2s1d5c2+d4s2]\begin{bmatrix} d_0 \\ d_2 \\ d_4 \\ d_1 \\ d_3 \\ d_5 \\ \end{bmatrix} \cdot \begin{bmatrix} c_0 \\ c_1 \\ c_2 \\ c_0 \\ c_1 \\ c_2 \\ \end{bmatrix} + \begin{bmatrix} -d_1 \\ -d_3 \\ -d_5 \\ d_0 \\ d_2 \\ d_4 \\ \end{bmatrix} \cdot \begin{bmatrix} s_0 \\ s_1 \\ s_2 \\ s_0 \\ s_1 \\ s_2 \\ \end{bmatrix} = \begin{bmatrix} d_0c_0 - d_1s_0 \\ d_2c_1 - d_3s_1 \\ d_4c_2 - d_5s_2 \\ d_1c_0 + d_0s_0 \\ d_3c_1 + d_2s_1 \\ d_5c_2 + d_4s_2 \\ \end{bmatrix}

We find that the DeepSeek style (permute then apply NeoX RoPE) simply produces a permuted version of the GPT-J style result. Specifically, the resulting vector is [r0,r2,r4,r1,r3,r5][r_0, r_2, r_4, r_1, r_3, r_5] where rr is the result vector from the GPT-J style calculation.

Recall how attention is calculated, using the dot product of q and k:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

When q and k are permuted in the same way, their dot product remains unchanged. Let PP be the permutation matrix. Then (Pq)T(Pk)=qTPTPk(Pq)^T(Pk) = q^T P^T P k. If PP is a permutation matrix representing the specific swap used, PTP=IP^T P = I (identity matrix), so qTPTPk=qTIk=qTkq^T P^T P k = q^T I k = q^T k. The dot product result is the same.

So the DeepSeek style (in Hugging Face) is actually equivalent to the GPT-J style in terms of the final attention scores.

Now we can return to the Llama code in transformers. Permuting the qw and kw weights beforehand has the same effect as permuting the resulting q and k vectors after the matrix multiplication but before applying RoPE.

In the regular approach (no weight permutation), applying the linear layer:

qT=xT×Wq=xT×[w0w1w2w3w4w5]=[d0d1d2d3d4d5]\mathbf{q}^T = \mathbf{x}^T \times \mathbf{W_q} = \mathbf{x}^T \times \begin{bmatrix} \mathbf{w_0} & \mathbf{w_1} & \mathbf{w_2} & \mathbf{w_3} & \mathbf{w_4} & \mathbf{w_5} \end{bmatrix} = \begin{bmatrix} d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \end{bmatrix}

where di=xwid_i = \mathbf{x} \cdot \mathbf{w_i}.

In the permuted weight approach:

qpermuted_weightsT=xT×Wqpermuted=xT×[w0w2w4w1w3w5]=[d0d2d4d1d3d5]\mathbf{q}_{permuted\_weights}^T = \mathbf{x}^T \times \mathbf{W_q}_{permuted} = \mathbf{x}^T \times \begin{bmatrix} \mathbf{w_0} & \mathbf{w_2} & \mathbf{w_4} & \mathbf{w_1} & \mathbf{w_3} & \mathbf{w_5} \end{bmatrix} = \begin{bmatrix} d_0 & d_2 & d_4 & d_1 & d_3 & d_5 \end{bmatrix}

Applying the linear layer with permuted weights yields a result vector q that is already permuted in the same way as the DeepSeek style's explicit permutation of q. Then, the Hugging Face Llama code applies the NeoX-style RoPE to this already-permuted vector, which, as shown above, is equivalent to applying the GPT-J style RoPE to the original, unpermuted q.

Now, we understand the complete picture:

Llama used GPT-J style RoPE during training. However, when converting its original weights to the Hugging Face format, it permuted the qw and kw weights and used a GPT-NeoX-like style for inference (for performance reasons).

DeepSeek also used GPT-J style RoPE during training, but it forgot to permute the qw and kw weights during the weight conversion. Therefore, it needed to add the permutation of q and k within the transformer's inference code (thus gaining no performance benefit from using the NeoX RoPE calculation itself).

One last question remains: if the GPT-NeoX style RoPE calculation is more performant, why do most open-source models still use the GPT-J style RoPE during training? The answer might be related to long context window extension. I haven't delved deeply into this issue yet.

Adverse Effects on Other AI Frameworks

Now that we understand the RoPE situation for Llama and DeepSeek, unfortunately, the story is far from over. Many AI frameworks copied the Hugging Face code for DeepSeek, leading to unnecessary complexity in their training code.

For example, in the Megatron-LM code:

if self.multi_latent_attention and self.rotary_interleaved:
raise ValueError("rotary_interleaved does not work with multi_latent_attention.")

It prohibits the use of interleaved RoPE and specifically handles DeepSeek in its RoPE code, mimicking the Hugging Face code:

if multi_latent_attention:
x1 = t[..., 0::2]
x2 = t[..., 1::2]
t = torch.cat((x1, x2), dim=-1)

In PaddlePaddle, it also handles DeepSeek RoPE in the same way:

b, s, h, d = q.shape
q = q.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])

b, s, h, d = k.shape
k = k.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])

Lessons Learned

So, what can we learn from this experience?

  1. Trust different sources according to the following priority: Paper > Original implementation code > Hugging Face code > Third-party frameworks.
  2. Solutions from the open-source community might not be the optimal implementation; think before you act (or copy).
  3. Pay special attention to the differences between inference code and training code when converting between them.

Rotary Positional Embedding

· 5 min read

Rope (Rotary Positional Embedding) is a type of relative positional encoding. Currently, most mainstream large models use Rope or its variants. The original paper on Rope can be found in RoFormer: Enhanced Transformer with Rotary Position Embedding.

Since self-attention computation is position-independent, positional encoding has been added since the invention of the transformer to capture dependencies between different positions. The transformer uses absolute positional encoding.

pk,2t=sin(k/100002t/d)pk,2t+1=cos(k/100002t/d)\begin{aligned} p_{k, 2t} &= \sin(k / 10000^{2t / d}) \\ p_{k, 2t+1} &= \cos(k / 10000^{2t / d}) \end{aligned}

Because absolute positional encoding is directly added to the token embedding, it cannot directly model the relative positions between tokens. The inference performance on sequences exceeding the training data length drops sharply. Relative positional encoding is used to correct this problem, and Rope has become the mainstream approach.

Introduction to RoPE

The core idea of Rope is to find a positional encoding function such that the following equation holds:

f(qm,m),f(kn,n)=g(qm,kn,mn)\langle f(\mathbf{q}_m, m), f(\mathbf{k}_n, n) \rangle = g(\mathbf{q}_m, \mathbf{k}_n, m - n)

That is, when calculating the dot product of qq and kk during attention, the result is independent of the absolute positions mm and nn of the tokens in qq and kk, and only depends on the relative position mnm - n.

When the embedding dimension dd is only 2, the following formulas precisely satisfy the above property:

f(qm,m)=(qm)eimθf(kn,n)=(kn)einθg(qm,kn,mn)=Re[(qm)(kn)ei(mn)θ]\begin{aligned} f(\mathbf{q}_m, m) &= (\mathbf{q}_m)e^{im\theta} \\ f(\mathbf{k}_n, n) &= (\mathbf{k}_n)e^{in\theta} \\ g(\mathbf{q}_m, \mathbf{k}_n, m - n) &= \text{Re}[(\mathbf{q}_m)(\mathbf{k}_n)^* e^{i(m-n)\theta}] \end{aligned}

The proof of the above equation involves the application of complex exponential functions, mainly relying on the following three properties:

z1,z2=Re[z1z2](z1z2)=z1z2(eiϕ)=eiϕ\begin{aligned} \langle z_1, z_2 \rangle &= \text{Re}[z_1 z_2^*] \\ (z_1z_2)^* &= z_1^* z_2^* \\ (e^{i\phi})^* &= e^{-i\phi} \end{aligned}

Using the above three formulas, the Rope formula above can be easily derived.

According to Euler's formula:

eiϕ=cos(ϕ)+isin(ϕ)e^{i\phi} = \cos(\phi) + i \sin(\phi)

Expanding this, we can obtain ff, which is consistent for both qq and kk:

f(qm,m)=(cosmθsinmθsinmθcosmθ)(qm(1)qm(2))f(\boldsymbol{q}_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} q_m^{(1)} \\ q_m^{(2)} \end{pmatrix}

Actually, we can understand the above Rope formula from another more intuitive perspective. The geometric meaning of the dot product of two-dimensional vectors is the product of their lengths multiplied by the cosine of the angle between them. The above Rope positional encoding function is equivalent to rotating the vector while keeping its length unchanged. Therefore, calculating the dot product of two rotated vectors only involves the relative rotation angle and is independent of the absolute angles.

Once the d=2d=2 scenario is understood, it becomes relatively easy to understand the scenario where dd is any even number. The embedding dimension is divided into pairs, and different θi=100002(i1)/d,i[1,2,...,d/2]\theta_i = 10000^{-2(i-1)/d}, i \in [1, 2, ..., d/2] are applied according to the pair number, resulting in the complete Rope positional encoding function.

RΘ,md=(cosmθ1sinmθ10000sinmθ1cosmθ1000000cosmθ2sinmθ20000sinmθ2cosmθ2000000cosmθd/2sinmθd/20000sinmθd/2cosmθd/2)\mathbf{R}_{\boldsymbol{\Theta}, m}^{d} = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ \sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \\ 0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\ 0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2} \end{pmatrix}

The core concept of RoPE is to rotate the embedding vectors based on the token's position. This is achieved by applying a rotation matrix to the token's embedding, where the rotation angle is determined by the token's position in the sequence. By rotating the embeddings instead of using fixed position encodings, the model can maintain more flexible and continuous position information.

Introduction to YaRN

Although Rope apply relative position embedding, it is still limit in generalizing past the context windows seen during training. Several extension methods were proposed. YaRN is the most popular one between performance and complexity. The original paper can be found in YaRN: Efficient Context Window Extension of Large Language Models.

θinew=[γi+(1γi)LL]θi,γi={1,ri>τ0,ri<1ri1τ1,others\theta_i^{new} = \left[ \gamma_i + (1 - \gamma_i) \frac{L}{L'} \right] \theta_i, \quad \gamma_i = \begin{cases} 1, & r_i > \tau \\ 0, & r_i < 1 \\ \frac{r_i - 1}{\tau - 1}, & \text{others} \end{cases}

The key point is that when θi\theta_i is small enough, it rotate slow, the max rotated angle is ri=θiL2πr_i = \frac{\theta_i L}{2\pi}. if ri<1r_i < 1, it didn't go through the whole cycle during training, so we should interpolate the θi\theta_i. If ri>τr_i > \tau, it can safely extrapolate. A linear translation is applied between the two conditions.

YaRN also add a scale weight to softmax, which is:

λ=(1+0.1lnLL)2\lambda = \left( 1 + 0.1 \ln \frac{L'}{L} \right)^2

It is just an experience value without theory support.

Open Source Implementation

Transformer-Engine

class FusedRoPEFunc(torch.autograd.Function):
def forward(
ctx,
t: torch.Tensor,
freqs: torch.Tensor,
tensor_format: str = "sbhd",
interleaved: bool = False,
cu_seqlens: Union[torch.Tensor, None] = None,
cp_size: int = 1,
cp_rank: int = 0,
) -> torch.Tensor:
def backward(ctx, grad_output: torch.Tensor) -> Tuple[Union[torch.Tensor, None], ...]:

Flash-Attention

class ApplyRotaryEmb(torch.autograd.Function):
def forward(
ctx,
x,
cos,
sin,
interleaved=False,
inplace=False,
seqlen_offsets: Union[int, torch.Tensor] = 0,
cu_seqlens: Optional[torch.Tensor] = None,
max_seqlen: Optional[int] = None,
):

def backward(ctx, do):

Troubleshooting Send to Kindle Conversion Failures

· One min read

Occasionally, users encounter difficulties when attempting to send EPUB files to their Kindle devices using Amazon's "Send to Kindle" service. The underlying causes for these conversion failures remain elusive, presenting a challenge for consistent troubleshooting.

However, a practical workaround often proves effective:

Intermediate AZW3 Conversion:

  • Convert the problematic EPUB file to the AZW3 format, which is a Kindle-native format.
  • Subsequently, convert the AZW3 file back to EPUB.

This two-step conversion process can effectively address certain compatibility issues that may be hindering the initial "Send to Kindle" conversion. By introducing this intermediate format, the file undergoes a re-processing that often resolves underlying formatting or encoding conflicts.

While this workaround is frequently successful, it's important to note that the specific reasons for the original conversion failures can vary. Amazon's conversion algorithms and the inherent complexities of EPUB formatting contribute to this variability.

Floating-Point Representation A Deep Dive

· 5 min read

IEEE 754 Floating-Point Standard

The IEEE 754 standard defines how floating-point numbers are represented in computers. A number, VV, is expressed as:

V=(1)s×M×2EV = (-1)^s \times M \times 2^E

Where:

  • ss (Sign): Determines the sign of the number: s=0s = 0 for positive, s=1s = 1 for negative.
  • MM (Significand/Mantissa): A fractional binary number. It ranges from 11 to 2ϵ2 - \epsilon for normalized values, or from 00 to 1ϵ1 - \epsilon for denormalized values, where ϵ\epsilon is the machine epsilon.
  • EE (Exponent): Weights the value by a power of 2.

Common Floating-Point Formats

The following table summarizes the key characteristics of common floating-point formats:

FormatTotal BitsExponent Bits (kk)Fraction Bits (nn)
Double641152
Float32823
FP1616510
BF161687

Special Value Categories

Floating-point numbers can represent various special values, defined by the exponent (ee) and fraction (ff) fields:

CategoryConditionValue
Normalized Values0<e<2k10 < e < 2^k - 1(1)s×(1+f)×2ebias(-1)^s \times (1 + f) \times 2^{e - bias}
Denormalized Valuese=0e = 0(1)s×f×21bias(-1)^s \times f \times 2^{1 - bias}
Infinitye=2k1e = 2^k - 1, f=0f = 0(1)s×(-1)^s \times \infty
NaN (Not a Number)e=2k1e = 2^k - 1, f0f \neq 0NaN

Where the bias is 2k112^{k-1} - 1.

Denormalized numbers serve two crucial purposes:

  1. Representation of Zero: They allow for distinct representations of positive (+0.0+0.0) and negative (0.0-0.0) zero, differentiated by the sign bit.
  2. Representation of Values Close to Zero: They enable the representation of numbers very close to 0.00.0, filling the gap between zero and the smallest normalized number.

Example: 6-Bit Floating-Point Format

Let's illustrate with a 6-bit floating-point format (1 sign bit, 3 exponent bits, 2 fraction bits):

DescriptionBit RepresentationEEffValue
Zero0 000 00-30/40/32
Smallest Positive0 000 01-31/41/32
Largest Denormalized0 000 11-33/43/32
Smallest Normalized0 001 00-20/44/32
One0 011 0000/44/4
Largest Normalized0 110 1133/428/4
Infinity0 111 00--\infty

Rounding Modes

When a number cannot be represented exactly, rounding is necessary. Common rounding modes include:

Mode1.41.61.52.5-1.5
Round-to-Even1222-2
Round-Toward-Zero1112-1
Round-Down (Floor)1112-2
Round-Up (Ceiling)2223-1

Floating-Point Operations: Precision and Pitfalls

A significant challenge with floating-point arithmetic is the "big eats small" phenomenon. Consider:

3.14+1×10101×1010=0.03.14 + 1 \times 10^{10} - 1 \times 10^{10} = 0.0

This occurs due to the following steps in floating-point addition:

  1. Alignment: Exponents are aligned by shifting the significand of the smaller number until both exponents match.
  2. Significand Addition: The significands are added.
  3. Normalization and Rounding: The result is normalized and rounded if necessary.

Precision loss happens during the alignment step when one number is significantly larger than the other.

Python Representation of Floating-Point Numbers

The following Python code demonstrates how to decompose a float into its significand and exponent:

import struct

def float_to_fe(f):
packed = struct.pack('>f', f)
int_val = struct.unpack('>I', packed)[0]
sign = (int_val >> 31) & 1
exponent = (int_val >> 23) & 0xFF
mantissa = int_val & 0x7FFFFF

if exponent == 0xFF: # Infinity or NaN
if mantissa == 0:
return "Infinity" if sign == 0 else "-Infinity"
else:
return "NaN"

bias = 127
if exponent == 0:
e = 1 - bias
mantissa_binary = f"0.{mantissa:023b}" #denormalized
else:
e = exponent - bias
mantissa_binary = f"1.{mantissa:023b}" #normalized

if sign == 1:
mantissa_binary = "-" + mantissa_binary

return f"{mantissa_binary} * 2^{e}"

Example:

3.14=1.10010001111010111000011×211×1010=1.00101010000001011111001×2333.14 = 1.10010001111010111000011 \times 2^1 \\ 1 \times 10^{10} = 1.00101010000001011111001 \times 2^{33}

To align 3.143.14 with 1×10101 \times 10^{10}, its significand must be right-shifted by 32 bits. Due to the 23-bit fraction field in single-precision floats, 3.143.14 effectively becomes 0.00.0.

Conversion Between Floating-Point Formats

Converting between floating-point formats can lead to:

  • Overflow: If the target format's exponent range is smaller.
  • Loss of Precision/Underflow: If the target format's fraction field is smaller.

OpenCore Linux DualBoot Setting

· 2 min read

As Apple and Homebrew ceased support for my 2015 MacBook Pro, I turned to OpenCore Patcher to ensure continued functionality. However, recently, many Python packages (pypi) no longer support Intel macOS versions, leading me to free approximately 400GB of disk space and instal Linux Mint alongside macOS.

Upon installation, the OpenCore Patcher boot menu unexpectedly disappeared after rebooting into Linux Mint. This issue arises because Linux Mint put it the first one in the boot order. Each operating system—whether it’s macOS or Linux—installs its own bootloader in the EFI partition. In my case, the EFI partition was mounted under /boot/efi when in Linux Mint.

To resolve this, I followed the easiest way in OpenCore MultiBoot

  1. Backup and Modify the OpenCore Config File:

    • The original configuration file for OpenCore is located at /boot/efi/EFI/OC/config.plist.
    • Backup this file before making any changes.
    • Navigate to Misc -> BlessOverride within the config.plist editor.
    • Change the value from \EFI\Microsoft\Boot\bootmgfw.efi to \EFI\ubuntu\grubx64.efi.
  2. Restart and Access OpenCore Menu:

    • After making these changes, restart your computer while holding down the Option key.
    • From the boot menu, select "OpenCore" to access the EFI menu.
  3. Set OpenCore as Default Bootloader:

    • Once inside the OpenCore menu, it will automatically set as the first entry in the boot order.
    • This means you won’t need to press Option every time you reboot your system.

By following these steps, I have successfully restored the functionality of my dual-boot setup, ensuring a seamless transition between macOS and Linux Mint while retaining control over which operating system boots by default.