Blog | Anyinlover's Cabin

Unraveling the Hang in Python's Multiprocessing Pool

May 22, 2025 · 4 min read

Python's multiprocessing.Pool is an incredibly powerful tool for parallelizing tasks, but it can occasionally lead to frustrating "hangs" that are notoriously difficult to debug. This article explores common scenarios where Pool might get stuck and provides robust solutions to prevent such issues.

The Silent Killer: Abnormal Child Process Termination

Consider the following seemingly innocuous program:

import sys
import multiprocessing as mp

def run(rank):
    if rank == 0:
        print("hi from rank 0", flush=True)
    else:
        sys.exit(1) # Exits abnormally for rank != 0

with mp.Pool(processes=2) as p: 
    p.map(run, [(rank,) for rank in range(2)])

When a child process exits abnormally (e.g., via sys.exit(1) as seen above), the Pool internally waits indefinitely for results from that process. Since the process terminated unexpectedly, those results will never arrive, leading to a hang. The map function blocks until all results are collected, creating a deadlock.

Solution: Leveraging `map_async` with a Timeout

A common and effective strategy to mitigate indefinite hangs is to use Pool's asynchronous API, map_async, in conjunction with a timeout. This allows you to set an upper limit on how long you're willing to wait for results.

import sys
import multiprocessing as mp

def run(rank):
    if rank == 0:
        print("hi from rank 0", flush=True)
    else:
        sys.exit(1)

with mp.Pool(processes=2) as p:
    async_result = p.map_async(run, [(rank,) for rank in range(2)])
    try:
        result = async_result.get(timeout=5)
    except mp.TimeoutError:
        print("Timed out while waiting for results.")
    finally:
        p.terminate() # forcefully terminates worker processes
        p.join()      # waits for worker processes to exit

Key Improvements:

map_async: This function returns an AsyncResult object immediately, allowing your main program to continue execution.
async_result.get(timeout=5): This is crucial. It attempts to retrieve the results but will raise a multiprocessing.TimeoutError if the results aren't available within 5 seconds.
p.terminate(): Unlike p.close(), which waits for all tasks to complete before shutting down, p.terminate() immediately stops the worker processes. This is essential when you've hit a timeout or detected an issue.
p.join(): This waits until the worker processes have actually exited. It's good practice to call join() after terminate() to ensure a clean shutdown.

When `terminate()` Isn't Enough: The Stubborn Process

Even with p.terminate(), you might encounter scenarios where a child process remains unresponsive. This often happens when the process is executing low-level code (like C extensions) that doesn't immediately respond to signals. Consider this example:

import time
import signal
import multiprocessing as mp

def run(rank):
    if rank == 0:
        print("hi from rank 0", flush=True)
    else:
        # Simulate a process stuck in a blocking operation (e.g., C extension, I/O)
        # We explicitly ignore SIGTERM to demonstrate stubbornness
        signal.signal(signal.SIGTERM, lambda signum, frame: None)
        time.sleep(600) # Long sleep to simulate blocking I/O

with mp.Pool(processes=2) as p:
    async_result = p.map_async(run, [(rank,) for rank in range(2)])
    try:
        result = async_result.get(timeout=5)
    except mp.TimeoutError:
        print("Timed out, but process might still be stuck.")
    finally:
        p.terminate()
        p.join()

Why it still hangs (or appears to): p.terminate() sends SIGTERM (signal 15) to its child processes. However, a process can choose to ignore SIGTERM, as demonstrated by signal.signal(signal.SIGTERM, lambda signum, frame: None). If the process is deep within a blocking C extension or a system call, it might not even have the opportunity to process the signal, effectively making it unresponsive.

The Last Resort: Forceful Termination with `SIGKILL`

In situations where SIGTERM fails to stop a process, the only remaining option is to send SIGKILL (signal 9). This signal cannot be caught or ignored by the process, forcing its immediate termination.

While multiprocessing.Pool doesn't directly expose a mechanism to send SIGKILL to individual worker processes, you can implement a more robust shutdown mechanism by tracking child process PIDs and terminating them explicitly if p.join() times out.

The key takeaway is that while multiprocessing.Pool is convenient, highly resilient applications dealing with potentially unstable or long-running child processes might benefit from a more explicit process management strategy using multiprocessing.Process objects directly, which allows for individual process tracking and more aggressive termination if necessary.

Patching with Python's Import System

May 13, 2025 · 2 min read

In many scenarios, we need to apply patches to third-party libraries in Python. A common approach is to use "monkey patching." However, monkey patching is not a perfect solution because it dynamically changes attributes after a module has been imported. Sometimes, the module being modified might have already been imported before the changes take effect, causing the monkey patch to not work as expected.

We need to find a way to modify modules as early as possible. A better method is to leverage Python's import system to achieve this. For detailed documentation on Python's import system, please refer to the official documentation. In short, Python imports a module in three steps:

Search for the module using a Finder.
Create the module using a Loader.
Bind the module in the current namespace.

In step 1, we can hook into sys.meta_path to create a custom finder, which can return a different module specification (module spec) based on a given module name. In step 2, we can create a new loader for a specific module, which replaces certain attributes (functions, classes, variables) of the module before the created module is returned.

Therefore, with this approach, we can replace an entire module or its attributes when the module is first imported. Since sys.modules acts as a cache, each module is created only once. Consequently, after a module is modified, it will never change again, which is exactly what we expect.

Below is an example code:

import importlib
import importlib.abc
import importlib.util
import sys

ALIAS_MAP = {"d": "patch.d"}
REPLACE_MAP = {"pkg.a.a_1": {"a_1_func1": ("patch.pkg.a.a_1", "a_1_func1_wrapper")}}


class AttrReplacingLoader(importlib.abc.Loader):
    def __init__(self, original_loader, replace_map):
        self.original_loader = original_loader
        self.replace_map = replace_map

    def create_module(self, spec):
        if hasattr(self.original_loader, "create_module"):
            return self.original_loader.create_module(spec)
        return None

    def exec_module(self, module):
        self.original_loader.exec_module(module)
        to_replace = self.replace_map[module.__spec__.name]
        for attr_name, (new_module_name, new_attr_name) in to_replace.items():
            new_module = importlib.import_module(new_module_name)
            if hasattr(module, attr_name):
                setattr(module, attr_name, getattr(new_module, new_attr_name)(getattr(module, attr_name)))


class PatchModuleFinder(importlib.abc.MetaPathFinder):
    def __init__(self, alias_map, replace_map):
        self.alias_map = alias_map
        self.replace_map = replace_map

    def find_spec(self, fullname, path, target=None):
        if fullname in self.alias_map:
            original_name = self.alias_map[fullname]
            original_spec = importlib.util.find_spec(original_name)
            if original_spec:
                return original_spec

        if fullname in self.replace_map:
            if self in sys.meta_path:
                sys.meta_path.remove(self)
            original_spec = None
            try:
                original_spec = importlib.util.find_spec(fullname, path)
            finally:
                if self not in sys.meta_path:
                    sys.meta_path.insert(0, self)

            if original_spec and original_spec.loader:
                return importlib.util.spec_from_loader(
                    fullname,
                    AttrReplacingLoader(original_spec.loader, self.replace_map),
                    origin=original_spec.origin,
                    is_package=original_spec.submodule_search_locations is not None,
                )
        return None

sys.meta_path.insert(0, PatchModuleFinder(ALIAS_MAP, REPLACE_MAP))

Training Paradigms and Frameworks for Reasoning Models

May 6, 2025 · 4 min read

Introduction: Reasoning Models and "Test Time Scaling"

As model size and training datasets continue to expand, the scaling laws traditionally followed by large language model training are gradually revealing limitations, yielding diminishing marginal returns. Concurrently, the inherent shortcomings of traditional training methods, such as inadequate understanding when tackling complex problems requiring deep reasoning, are becoming increasingly apparent.

Represented by research such as OpenAI's o1 model, a new type of "reasoning model" has emerged. A key characteristic of these models is their ability to dynamically adjust the computation time and resources needed for reasoning based on the complexity of the problem. This has led to a new scaling law known as "Test Time Scaling". This capability, dedicating varying depths of "thought" according to problem difficulty, is often compared to the "System 2 thinking" proposed by Daniel Kahneman in "Thinking, Fast and Slow," distinguishing it from the fast, intuitive, immediate responses ("System 1 thinking") of traditional large models. By leveraging this deep, step-by-step thinking ability, reasoning models hold the potential to solve more complex problems that were previously challenging for existing models.

Progress in Open Source Reasoning Models

In the open-source community, DeepSeek-R1 stands out as the first representative model employing such reasoning training techniques. Trained by combining rule-based reinforcement learning and the DRPO algorithm, this model achieved significant results and garnered widespread industry attention upon its release.

Since then, integrating reasoning or thought processes into training has become a major trend for mainstream open-source models. For instance, models like Llama 4, Qwen 3, and DeepSeek-Prover-V2 have all incorporated related reasoning-enhanced techniques into their training strategies. Furthermore, with the continuous iteration of similar models (such as DeepSeek-R2), it is foreseeable that reasoning models will become an important paradigm for large models to further elevate their capability ceilings.

Currently, the potential of reasoning models is far from fully realized. Related research remains at the academic forefront (see the Awesome-Inference-Time-Scaling list), with relevant papers continuously emerging, suggesting its potential to evolve into a new, significant model training paradigm.

Core Algorithms

From an algorithmic standpoint, current training for reasoning models primarily centers on Reinforcement Learning (RL) techniques. These methods are largely consistent with the algorithms used for human preference alignment during the post-training phase of traditional large models.

Mainstream algorithms include:

PPO
DRPO
Rule-based Reinforcement Learning

Simultaneously, the academic community is actively exploring and proposing new RL algorithms, such as RLOO and REINFORCE++. It is predictable that, in the short term, RL algorithms for training reasoning models will continue to undergo rapid development and iteration, without converging soon. This necessitates that training frameworks remain flexible and open.

Reinforcement Learning Training Frameworks

New training paradigms require the support of corresponding training frameworks. Unlike the dominance of frameworks like Megatron-LM in the traditional LLM pre-training domain, the current landscape for large-scale distributed reinforcement learning training frameworks is diverse. Here are several currently popular or noteworthy RL training frameworks:

verl (GitHub)
- Developer: ByteDance
- Features: Built on Ray, supports integration with mainstream training/inference systems like FSDP, Megatron-LM, vLLM. Designed for easy extension with new RL algorithms and offers good performance.
- License: Apache License
- Popularity: GitHub ~7.6k stars
OpenRLHF (GitHub)
- Developer: Open Source Community
- Features: Built on Ray, integrates DeepSpeed and vLLM, supports multiple RL algorithms.
- License: Apache-2.0 License
- Popularity: GitHub ~6.6k stars
TRL (GitHub)
- Developer: Hugging Face
- Features: Can utilize Accelerate to integrate DeepSpeed for acceleration. Comparatively, less deeply integrated with dedicated inference frameworks, more focused on research and experimental scenarios.
- License: Apache-2.0 License
- Popularity: GitHub ~13.6k stars
DeepSpeed Chat (GitHub)
- Developer: Microsoft
- Features: Implemented based on the DeepSpeed training framework and inference engine.
- License: Apache-2.0 License
- Popularity: GitHub ~38.2k stars (Note: star count reflects the entire DeepSpeed project)
Nemo-Aligner (GitHub)
- Developer: NVIDIA
- Features: Implemented based on Megatron-LM and TensorRT-LLM. Community activity is relatively low at present.
- License: Apache-2.0 License
- Popularity: GitHub ~0.7k stars

Framework Trend Analysis: From a technical standpoint, frameworks like verl, which leverage Ray for distributed scheduling while integrating mainstream distributed training libraries (like Megatron-LM) and efficient inference engines (like vLLM), represent a highly promising approach for large-scale model reinforcement learning. This is because they can effectively reuse mature components and adaptation experiences from the existing large model ecosystem.

Unveiling the RoPE Implementations of Llama and DeepSeek

April 29, 2025 · 9 min read

Almost all open-source models use RoPE (Rotary Position Embedding) based on the same theory from RoFormer: Enhanced Transformer with Rotary Position Embedding. However, there are two ways to implement RoPE: the GPT-J style and the GPT-NeoX style.

GPT-J and GPT-NeoX RoPE Implementations

The GPT-J style is identical to the original RoFormer, using an interleaved method to calculate RoPE. The GPT-NeoX style uses an alternative, non-interleaved method. According to the Eleuther AI blog, they considered the original implementation inefficient and thus improved it by splitting the dimension into two halves (non-interleaved). Note that the GPT-NeoX and GPT-J styles produce different results.

The GPT-NeoX style RoPE calculation is as follows:

import torch

class RotaryEmbedding(torch.nn.Module):
    def __init__(self, dim, base=10000):
        super().__init__()
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer("inv_freq", inv_freq)

    def forward(self, max_seq_len):
        seq = torch.arange(max_seq_len, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
        freqs = torch.outer(seq, self.inv_freq)
        emb = torch.cat((freqs, freqs), dim=-1)
        cos = emb.cos()[:, None, None, :]
        sin = emb.sin()[:, None, None, :]
        return cos, sin

def _rotate_half(x):
    x1, x2 = torch.chunk(x, 2, dim=-1)
    return torch.cat((-x2, x1), dim=-1)

def apply_rotary_pos_emb(t, cos, sin):
    return t * cos + _rotate_half(t) * sin

The GPT-J style has two ways to implement RoPE, with the complex number method being more intuitive.

Complex number method:

import torch

class RotaryEmbedding(torch.nn.Module):
    def __init__(self, dim, base=10000):
        super().__init__()
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer("inv_freq", inv_freq)

    def forward(self, max_seq_len):
        seq = torch.arange(max_seq_len, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
        freqs = torch.outer(seq, self.inv_freq)
        freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
        return freqs_cis[:, None, None, :]

def apply_rotary_pos_emb(t, freqs_cis):
    t_ = torch.view_as_complex(t.float().reshape(*t.shape[:-1], -1, 2))
    rotated_t_complex = t_ * freqs_cis
    return torch.view_as_real(rotated_t_complex).flatten(3)

Traditional method:

import torch

class RotaryEmbedding(torch.nn.Module):
    def __init__(self, dim, base=10000):
        super().__init__()
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer("inv_freq", inv_freq)

    def forward(self, max_seq_len):
        seq = torch.arange(max_seq_len, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
        freqs = torch.outer(seq, self.inv_freq)
        emb = torch.stack((freqs, freqs), dim=-1).flatten(start_dim=1)
        cos = emb.cos()[:, None, None, :]
        sin = emb.sin()[:, None, None, :]
        return cos, sin

def _rotate_every_two(x):
    x1 = x[:, :, :, ::2]
    x2 = x[:, :, :, 1::2]
    x_new = torch.stack((-x2, x1), dim=-1)
    return x_new.flatten(start_dim=3)

def apply_rotary_pos_emb(t, cos, sin):
    return t * cos + _rotate_every_two(t) * sin

Llama and DeepSeek RoPE Implementations

Due to the significant influence of the Hugging Face community, many people believe Llama uses the GPT-NeoX style based on its inference code in the transformers library. However, this is not the case. In Llama's original code, it implements the GPT-J style RoPE using the complex number method. So, why the difference between the two codebases? The answer lies in this issue. In the weight conversion script, they permuted the weights of q_proj and k_proj.

def permute(w, n_heads=n_heads, dim1=dim, dim2=dim):
    return w.view(n_heads, dim1 // n_heads // 2, 2, dim2).transpose(1, 2).reshape(dim1, dim2)

It's not immediately obvious why this works. We will come back to explain everything later.

A similar situation occurred with DeepSeek-V3. In the original code for deepseek-v3, it uses the same complex number method as Llama to compute GPT-J style RoPE (in fact, their code is very similar). Again, in the Hugging Face code for deepseek-v3, it uses a style similar to GPT-NeoX, just like Llama (again, the code is very similar), but with an exception on lines 364 and 367. Yes, it's very similar to the permute function mentioned above.

q = q.view(b, s, h, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)
k = k.view(b, s, h, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)

So, what is the difference between these two implementations? Many people have the same question, as seen in this discussion.

Since RoPE only acts on the last dimension, a simple example helps understand why.

In the GPT-NeoX style, if $dim=6$ , we have:

\begin{bmatrix} d_0 \\ d_1 \\ d_2 \\ d_3 \\ d_4 \\ d_5 \end{bmatrix} \cdot \begin{bmatrix} c_0 \\ c_1 \\ c_2 \\ c_0 \\ c_1 \\ c_2 \end{bmatrix} + \begin{bmatrix} -d_3 \\ -d_4 \\ -d_5 \\ d_0 \\ d_1 \\ d_2 \end{bmatrix} \cdot \begin{bmatrix} s_0 \\ s_1 \\ s_2 \\ s_0 \\ s_1 \\ s_2 \\ \end{bmatrix} = \begin{bmatrix} d_0c_0 - d_3s_0 \\ d_1c_1 - d_4s_1 \\ d_2c_2 - d_5s_2 \\ d_3c_0 + d_0s_0 \\ d_4c_1 + d_1s_1 \\ d_5c_2 + d_2s_2 \end{bmatrix}

In the GPT-J style (interleaved), we have:

\begin{bmatrix} d_0 \\ d_1 \\ d_2 \\ d_3 \\ d_4 \\ d_5 \\ \end{bmatrix} \cdot \begin{bmatrix} c_0 \\ c_0 \\ c_1 \\ c_1 \\ c_2 \\ c_2 \\ \end{bmatrix} + \begin{bmatrix} -d_1 \\ d_0 \\ -d_3 \\ d_2 \\ -d_5 \\ d_4 \\ \end{bmatrix} \cdot \begin{bmatrix} s_0 \\ s_0 \\ s_1 \\ s_1 \\ s_2 \\ s_2 \\ \end{bmatrix} = \begin{bmatrix} d_0c_0 - d_1s_0 \\ d_1c_0 + d_0s_0 \\ d_2c_1 - d_3s_1 \\ d_3c_1 + d_2s_1 \\ d_4c_2 - d_5s_2 \\ d_5c_2 + d_4s_2 \\ \end{bmatrix}

Now, let's look at the DeepSeek style (as implemented in Hugging Face, applying the permutation from the transpose operation before applying the NeoX-style rotation):

First, permute the input vector $d$ :

d_{permuted} = \begin{bmatrix} d_0 \\ d_2 \\ d_4 \\ d_1 \\ d_3 \\ d_5 \\ \end{bmatrix}

Then apply the NeoX-style rotation logic to $d_{permuted}$ :

\begin{bmatrix} d_0 \\ d_2 \\ d_4 \\ d_1 \\ d_3 \\ d_5 \\ \end{bmatrix} \cdot \begin{bmatrix} c_0 \\ c_1 \\ c_2 \\ c_0 \\ c_1 \\ c_2 \\ \end{bmatrix} + \begin{bmatrix} -d_1 \\ -d_3 \\ -d_5 \\ d_0 \\ d_2 \\ d_4 \\ \end{bmatrix} \cdot \begin{bmatrix} s_0 \\ s_1 \\ s_2 \\ s_0 \\ s_1 \\ s_2 \\ \end{bmatrix} = \begin{bmatrix} d_0c_0 - d_1s_0 \\ d_2c_1 - d_3s_1 \\ d_4c_2 - d_5s_2 \\ d_1c_0 + d_0s_0 \\ d_3c_1 + d_2s_1 \\ d_5c_2 + d_4s_2 \\ \end{bmatrix}

We find that the DeepSeek style (permute then apply NeoX RoPE) simply produces a permuted version of the GPT-J style result. Specifically, the resulting vector is $[r_0, r_2, r_4, r_1, r_3, r_5]$ where $r$ is the result vector from the GPT-J style calculation.

Recall how attention is calculated, using the dot product of q and k:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

When q and k are permuted in the same way, their dot product remains unchanged. Let $P$ be the permutation matrix. Then $(Pq)^T(Pk) = q^T P^T P k$ . If $P$ is a permutation matrix representing the specific swap used, $P^T P = I$ (identity matrix), so $q^T P^T P k = q^T I k = q^T k$ . The dot product result is the same.

So the DeepSeek style (in Hugging Face) is actually equivalent to the GPT-J style in terms of the final attention scores.

Now we can return to the Llama code in transformers. Permuting the qw and kw weights beforehand has the same effect as permuting the resulting q and k vectors after the matrix multiplication but before applying RoPE.

In the regular approach (no weight permutation), applying the linear layer:

\mathbf{q}^T = \mathbf{x}^T \times \mathbf{W_q} = \mathbf{x}^T \times \begin{bmatrix} \mathbf{w_0} & \mathbf{w_1} & \mathbf{w_2} & \mathbf{w_3} & \mathbf{w_4} & \mathbf{w_5} \end{bmatrix} = \begin{bmatrix} d_0 & d_1 & d_2 & d_3 & d_4 & d_5 \end{bmatrix}

where $d_i = \mathbf{x} \cdot \mathbf{w_i}$ .

In the permuted weight approach:

\mathbf{q}_{permuted\_weights}^T = \mathbf{x}^T \times \mathbf{W_q}_{permuted} = \mathbf{x}^T \times \begin{bmatrix} \mathbf{w_0} & \mathbf{w_2} & \mathbf{w_4} & \mathbf{w_1} & \mathbf{w_3} & \mathbf{w_5} \end{bmatrix} = \begin{bmatrix} d_0 & d_2 & d_4 & d_1 & d_3 & d_5 \end{bmatrix}

Applying the linear layer with permuted weights yields a result vector q that is already permuted in the same way as the DeepSeek style's explicit permutation of q. Then, the Hugging Face Llama code applies the NeoX-style RoPE to this already-permuted vector, which, as shown above, is equivalent to applying the GPT-J style RoPE to the original, unpermuted q.

Now, we understand the complete picture:

Llama used GPT-J style RoPE during training. However, when converting its original weights to the Hugging Face format, it permuted the qw and kw weights and used a GPT-NeoX-like style for inference (for performance reasons).

DeepSeek also used GPT-J style RoPE during training, but it forgot to permute the qw and kw weights during the weight conversion. Therefore, it needed to add the permutation of q and k within the transformer's inference code (thus gaining no performance benefit from using the NeoX RoPE calculation itself).

One last question remains: if the GPT-NeoX style RoPE calculation is more performant, why do most open-source models still use the GPT-J style RoPE during training? The answer might be related to long context window extension. I haven't delved deeply into this issue yet.

Adverse Effects on Other AI Frameworks

Now that we understand the RoPE situation for Llama and DeepSeek, unfortunately, the story is far from over. Many AI frameworks copied the Hugging Face code for DeepSeek, leading to unnecessary complexity in their training code.

For example, in the Megatron-LM code:

if self.multi_latent_attention and self.rotary_interleaved:
    raise ValueError("rotary_interleaved does not work with multi_latent_attention.")

It prohibits the use of interleaved RoPE and specifically handles DeepSeek in its RoPE code, mimicking the Hugging Face code:

if multi_latent_attention:
    x1 = t[..., 0::2]
    x2 = t[..., 1::2]
    t = torch.cat((x1, x2), dim=-1)

In PaddlePaddle, it also handles DeepSeek RoPE in the same way:

b, s, h, d = q.shape
q = q.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])

b, s, h, d = k.shape
k = k.reshape([b, s, h, d // 2, 2]).transpose([0, 1, 2, 4, 3]).reshape([b, s, h, d])

Lessons Learned

So, what can we learn from this experience?

Trust different sources according to the following priority: Paper > Original implementation code > Hugging Face code > Third-party frameworks.
Solutions from the open-source community might not be the optimal implementation; think before you act (or copy).
Pay special attention to the differences between inference code and training code when converting between them.

Rotary Positional Embedding

April 10, 2025 · 5 min read

Rope (Rotary Positional Embedding) is a type of relative positional encoding. Currently, most mainstream large models use Rope or its variants. The original paper on Rope can be found in RoFormer: Enhanced Transformer with Rotary Position Embedding.

Since self-attention computation is position-independent, positional encoding has been added since the invention of the transformer to capture dependencies between different positions. The transformer uses absolute positional encoding.

\begin{aligned} p_{k, 2t} &= \sin(k / 10000^{2t / d}) \\ p_{k, 2t+1} &= \cos(k / 10000^{2t / d}) \end{aligned}

Because absolute positional encoding is directly added to the token embedding, it cannot directly model the relative positions between tokens. The inference performance on sequences exceeding the training data length drops sharply. Relative positional encoding is used to correct this problem, and Rope has become the mainstream approach.

Introduction to RoPE

The core idea of Rope is to find a positional encoding function such that the following equation holds:

\langle f(\mathbf{q}_m, m), f(\mathbf{k}_n, n) \rangle = g(\mathbf{q}_m, \mathbf{k}_n, m - n)

That is, when calculating the dot product of $q$ and $k$ during attention, the result is independent of the absolute positions $m$ and $n$ of the tokens in $q$ and $k$ , and only depends on the relative position $m - n$ .

When the embedding dimension $d$ is only 2, the following formulas precisely satisfy the above property:

\begin{aligned} f(\mathbf{q}_m, m) &= (\mathbf{q}_m)e^{im\theta} \\ f(\mathbf{k}_n, n) &= (\mathbf{k}_n)e^{in\theta} \\ g(\mathbf{q}_m, \mathbf{k}_n, m - n) &= \text{Re}[(\mathbf{q}_m)(\mathbf{k}_n)^* e^{i(m-n)\theta}] \end{aligned}

The proof of the above equation involves the application of complex exponential functions, mainly relying on the following three properties:

\begin{aligned} \langle z_1, z_2 \rangle &= \text{Re}[z_1 z_2^*] \\ (z_1z_2)^* &= z_1^* z_2^* \\ (e^{i\phi})^* &= e^{-i\phi} \end{aligned}

Using the above three formulas, the Rope formula above can be easily derived.

According to Euler's formula:

e^{i\phi} = \cos(\phi) + i \sin(\phi)

Expanding this, we can obtain $f$ , which is consistent for both $q$ and $k$ :

f(\boldsymbol{q}_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} q_m^{(1)} \\ q_m^{(2)} \end{pmatrix}

Actually, we can understand the above Rope formula from another more intuitive perspective. The geometric meaning of the dot product of two-dimensional vectors is the product of their lengths multiplied by the cosine of the angle between them. The above Rope positional encoding function is equivalent to rotating the vector while keeping its length unchanged. Therefore, calculating the dot product of two rotated vectors only involves the relative rotation angle and is independent of the absolute angles.

Once the $d=2$ scenario is understood, it becomes relatively easy to understand the scenario where $d$ is any even number. The embedding dimension is divided into pairs, and different $\theta_i = 10000^{-2(i-1)/d}, i \in [1, 2, ..., d/2]$ are applied according to the pair number, resulting in the complete Rope positional encoding function.

\mathbf{R}_{\boldsymbol{\Theta}, m}^{d} = \begin{pmatrix} \cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ \sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \cdots & 0 & 0 \\ 0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\ 0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2} & \cos m\theta_{d/2} \end{pmatrix}

The core concept of RoPE is to rotate the embedding vectors based on the token's position. This is achieved by applying a rotation matrix to the token's embedding, where the rotation angle is determined by the token's position in the sequence. By rotating the embeddings instead of using fixed position encodings, the model can maintain more flexible and continuous position information.

Introduction to YaRN

Although Rope apply relative position embedding, it is still limit in generalizing past the context windows seen during training. Several extension methods were proposed. YaRN is the most popular one between performance and complexity. The original paper can be found in YaRN: Efficient Context Window Extension of Large Language Models.

\theta_i^{new} = \left[ \gamma_i + (1 - \gamma_i) \frac{L}{L'} \right] \theta_i, \quad \gamma_i = \begin{cases} 1, & r_i > \tau \\ 0, & r_i < 1 \\ \frac{r_i - 1}{\tau - 1}, & \text{others} \end{cases}

The key point is that when $\theta_i$ is small enough, it rotate slow, the max rotated angle is $r_i = \frac{\theta_i L}{2\pi}$ . if $r_i < 1$ , it didn't go through the whole cycle during training, so we should interpolate the $\theta_i$ . If $r_i > \tau$ , it can safely extrapolate. A linear translation is applied between the two conditions.

YaRN also add a scale weight to softmax, which is:

\lambda = \left( 1 + 0.1 \ln \frac{L'}{L} \right)^2

It is just an experience value without theory support.

Open Source Implementation

Transformer-Engine

class FusedRoPEFunc(torch.autograd.Function):
    def forward(
        ctx,
        t: torch.Tensor,
        freqs: torch.Tensor,
        tensor_format: str = "sbhd",
        interleaved: bool = False,
        cu_seqlens: Union[torch.Tensor, None] = None,
        cp_size: int = 1,
        cp_rank: int = 0,
    ) -> torch.Tensor:
    def backward(ctx, grad_output: torch.Tensor) -> Tuple[Union[torch.Tensor, None], ...]:

Flash-Attention

class ApplyRotaryEmb(torch.autograd.Function):
    def forward(
        ctx,
        x,
        cos,
        sin,
        interleaved=False,
        inplace=False,
        seqlen_offsets: Union[int, torch.Tensor] = 0,
        cu_seqlens: Optional[torch.Tensor] = None,
        max_seqlen: Optional[int] = None,
    ):

    def backward(ctx, do):

Troubleshooting Send to Kindle Conversion Failures

March 18, 2025 · One min read

Occasionally, users encounter difficulties when attempting to send EPUB files to their Kindle devices using Amazon's "Send to Kindle" service. The underlying causes for these conversion failures remain elusive, presenting a challenge for consistent troubleshooting.

However, a practical workaround often proves effective:

Intermediate AZW3 Conversion:

Convert the problematic EPUB file to the AZW3 format, which is a Kindle-native format.
Subsequently, convert the AZW3 file back to EPUB.

This two-step conversion process can effectively address certain compatibility issues that may be hindering the initial "Send to Kindle" conversion. By introducing this intermediate format, the file undergoes a re-processing that often resolves underlying formatting or encoding conflicts.

While this workaround is frequently successful, it's important to note that the specific reasons for the original conversion failures can vary. Amazon's conversion algorithms and the inherent complexities of EPUB formatting contribute to this variability.

Floating-Point Representation A Deep Dive

March 14, 2025 · 5 min read

IEEE 754 Floating-Point Standard

The IEEE 754 standard defines how floating-point numbers are represented in computers. A number, $V$ , is expressed as:

V = (-1)^s \times M \times 2^E

Where:

$s$ (Sign): Determines the sign of the number: $s = 0$ for positive, $s = 1$ for negative.
$M$ (Significand/Mantissa): A fractional binary number. It ranges from $1$ to $2 - \epsilon$ for normalized values, or from $0$ to $1 - \epsilon$ for denormalized values, where $\epsilon$ is the machine epsilon.
$E$ (Exponent): Weights the value by a power of 2.

Common Floating-Point Formats

The following table summarizes the key characteristics of common floating-point formats:

Format	Total Bits	Exponent Bits ( $k$ )	Fraction Bits ( $n$ )
Double	64	11	52
Float	32	8	23
FP16	16	5	10
BF16	16	8	7

Special Value Categories

Floating-point numbers can represent various special values, defined by the exponent ( $e$ ) and fraction ( $f$ ) fields:

Category	Condition	Value
Normalized Values	$0 < e < 2^k - 1$	$(-1)^s \times (1 + f) \times 2^{e - bias}$
Denormalized Values	$e = 0$	$(-1)^s \times f \times 2^{1 - bias}$
Infinity	$e = 2^k - 1$ , $f = 0$	$(-1)^s \times \infty$
NaN (Not a Number)	$e = 2^k - 1$ , $f \neq 0$	NaN

Where the bias is $2^{k-1} - 1$ .

Denormalized numbers serve two crucial purposes:

Representation of Zero: They allow for distinct representations of positive ( $+0.0$ ) and negative ( $-0.0$ ) zero, differentiated by the sign bit.
Representation of Values Close to Zero: They enable the representation of numbers very close to $0.0$ , filling the gap between zero and the smallest normalized number.

Example: 6-Bit Floating-Point Format

Let's illustrate with a 6-bit floating-point format (1 sign bit, 3 exponent bits, 2 fraction bits):

Description	Bit Representation	$E$	$f$	Value
Zero	0 000 00	-3	0/4	0/32
Smallest Positive	0 000 01	-3	1/4	1/32
Largest Denormalized	0 000 11	-3	3/4	3/32
Smallest Normalized	0 001 00	-2	0/4	4/32
One	0 011 00	0	0/4	4/4
Largest Normalized	0 110 11	3	3/4	28/4
Infinity	0 111 00	-	-	$\infty$

Rounding Modes

When a number cannot be represented exactly, rounding is necessary. Common rounding modes include:

Mode	1.4	1.6	1.5	2.5	-1.5
Round-to-Even	1	2	2	2	-2
Round-Toward-Zero	1	1	1	2	-1
Round-Down (Floor)	1	1	1	2	-2
Round-Up (Ceiling)	2	2	2	3	-1

Floating-Point Operations: Precision and Pitfalls

A significant challenge with floating-point arithmetic is the "big eats small" phenomenon. Consider:

3.14 + 1 \times 10^{10} - 1 \times 10^{10} = 0.0

This occurs due to the following steps in floating-point addition:

Alignment: Exponents are aligned by shifting the significand of the smaller number until both exponents match.
Significand Addition: The significands are added.
Normalization and Rounding: The result is normalized and rounded if necessary.

Precision loss happens during the alignment step when one number is significantly larger than the other.

Python Representation of Floating-Point Numbers

The following Python code demonstrates how to decompose a float into its significand and exponent:

import struct

def float_to_fe(f):
    packed = struct.pack('>f', f)
    int_val = struct.unpack('>I', packed)[0]
    sign = (int_val >> 31) & 1
    exponent = (int_val >> 23) & 0xFF
    mantissa = int_val & 0x7FFFFF

    if exponent == 0xFF:  # Infinity or NaN
        if mantissa == 0:
            return "Infinity" if sign == 0 else "-Infinity"
        else:
            return "NaN"

    bias = 127
    if exponent == 0:
        e = 1 - bias
        mantissa_binary = f"0.{mantissa:023b}" #denormalized
    else:
        e = exponent - bias
        mantissa_binary = f"1.{mantissa:023b}" #normalized

    if sign == 1:
        mantissa_binary = "-" + mantissa_binary

    return f"{mantissa_binary} * 2^{e}"

Example:

3.14 = 1.10010001111010111000011 \times 2^1 \\ 1 \times 10^{10} = 1.00101010000001011111001 \times 2^{33}

To align $3.14$ with $1 \times 10^{10}$ , its significand must be right-shifted by 32 bits. Due to the 23-bit fraction field in single-precision floats, $3.14$ effectively becomes $0.0$ .

Conversion Between Floating-Point Formats

Converting between floating-point formats can lead to:

Overflow: If the target format's exponent range is smaller.
Loss of Precision/Underflow: If the target format's fraction field is smaller.

OpenCore Linux DualBoot Setting

February 27, 2025 · 2 min read

As Apple and Homebrew ceased support for my 2015 MacBook Pro, I turned to OpenCore Patcher to ensure continued functionality. However, recently, many Python packages (pypi) no longer support Intel macOS versions, leading me to free approximately 400GB of disk space and instal Linux Mint alongside macOS.

Upon installation, the OpenCore Patcher boot menu unexpectedly disappeared after rebooting into Linux Mint. This issue arises because Linux Mint put it the first one in the boot order. Each operating system—whether it’s macOS or Linux—installs its own bootloader in the EFI partition. In my case, the EFI partition was mounted under /boot/efi when in Linux Mint.

To resolve this, I followed the easiest way in OpenCore MultiBoot

Backup and Modify the OpenCore Config File:
- The original configuration file for OpenCore is located at /boot/efi/EFI/OC/config.plist.
- Backup this file before making any changes.
- Navigate to Misc -> BlessOverride within the config.plist editor.
- Change the value from \EFI\Microsoft\Boot\bootmgfw.efi to \EFI\ubuntu\grubx64.efi.
Restart and Access OpenCore Menu:
- After making these changes, restart your computer while holding down the Option key.
- From the boot menu, select "OpenCore" to access the EFI menu.
Set OpenCore as Default Bootloader:
- Once inside the OpenCore menu, it will automatically set as the first entry in the boot order.
- This means you won’t need to press Option every time you reboot your system.

By following these steps, I have successfully restored the functionality of my dual-boot setup, ensuring a seamless transition between macOS and Linux Mint while retaining control over which operating system boots by default.

The Silent Killer: Abnormal Child Process Termination​

Solution: Leveraging map_async with a Timeout​

When terminate() Isn't Enough: The Stubborn Process​

The Last Resort: Forceful Termination with SIGKILL​

Introduction: Reasoning Models and "Test Time Scaling"​

Progress in Open Source Reasoning Models​

Core Algorithms​

Reinforcement Learning Training Frameworks​

GPT-J and GPT-NeoX RoPE Implementations​

Llama and DeepSeek RoPE Implementations​

Adverse Effects on Other AI Frameworks​

Lessons Learned​

Introduction to RoPE​

Introduction to YaRN​

Open Source Implementation​

Transformer-Engine​

Flash-Attention​

IEEE 754 Floating-Point Standard​

Common Floating-Point Formats​

Special Value Categories​

Example: 6-Bit Floating-Point Format​

Rounding Modes​

Floating-Point Operations: Precision and Pitfalls​

Python Representation of Floating-Point Numbers​

Conversion Between Floating-Point Formats​