In many scenarios, we need to apply patches to third-party libraries in Python. A common approach is to use "monkey patching." However, monkey patching is not a perfect solution because it dynamically changes attributes after a module has been imported. Sometimes, the module being modified might have already been imported before the changes take effect, causing the monkey patch to not work as expected.
We need to find a way to modify modules as early as possible. A better method is to leverage Python's import system to achieve this. For detailed documentation on Python's import system, please refer to the official documentation. In short, Python imports a module in three steps:
Search for the module using a Finder.
Create the module using a Loader.
Bind the module in the current namespace.
In step 1, we can hook into sys.meta_path to create a custom finder, which can return a different module specification (module spec) based on a given module name. In step 2, we can create a new loader for a specific module, which replaces certain attributes (functions, classes, variables) of the module before the created module is returned.
Therefore, with this approach, we can replace an entire module or its attributes when the module is first imported. Since sys.modules acts as a cache, each module is created only once. Consequently, after a module is modified, it will never change again, which is exactly what we expect.
Introduction: Reasoning Models and "Test Time Scaling"
As model size and training datasets continue to expand, the scaling laws traditionally followed by large language model training are gradually revealing limitations, yielding diminishing marginal returns. Concurrently, the inherent shortcomings of traditional training methods, such as inadequate understanding when tackling complex problems requiring deep reasoning, are becoming increasingly apparent.
Represented by research such as OpenAI's o1 model, a new type of "reasoning model" has emerged. A key characteristic of these models is their ability to dynamically adjust the computation time and resources needed for reasoning based on the complexity of the problem. This has led to a new scaling law known as "Test Time Scaling". This capability, dedicating varying depths of "thought" according to problem difficulty, is often compared to the "System 2 thinking" proposed by Daniel Kahneman in "Thinking, Fast and Slow," distinguishing it from the fast, intuitive, immediate responses ("System 1 thinking") of traditional large models. By leveraging this deep, step-by-step thinking ability, reasoning models hold the potential to solve more complex problems that were previously challenging for existing models.
In the open-source community, DeepSeek-R1 stands out as the first representative model employing such reasoning training techniques. Trained by combining rule-based reinforcement learning and the DRPO algorithm, this model achieved significant results and garnered widespread industry attention upon its release.
Since then, integrating reasoning or thought processes into training has become a major trend for mainstream open-source models. For instance, models like Llama 4, Qwen 3, and DeepSeek-Prover-V2 have all incorporated related reasoning-enhanced techniques into their training strategies. Furthermore, with the continuous iteration of similar models (such as DeepSeek-R2), it is foreseeable that reasoning models will become an important paradigm for large models to further elevate their capability ceilings.
Currently, the potential of reasoning models is far from fully realized. Related research remains at the academic forefront (see the Awesome-Inference-Time-Scaling list), with relevant papers continuously emerging, suggesting its potential to evolve into a new, significant model training paradigm.
From an algorithmic standpoint, current training for reasoning models primarily centers on Reinforcement Learning (RL) techniques. These methods are largely consistent with the algorithms used for human preference alignment during the post-training phase of traditional large models.
Mainstream algorithms include:
PPO
DRPO
Rule-based Reinforcement Learning
Simultaneously, the academic community is actively exploring and proposing new RL algorithms, such as RLOO and REINFORCE++. It is predictable that, in the short term, RL algorithms for training reasoning models will continue to undergo rapid development and iteration, without converging soon. This necessitates that training frameworks remain flexible and open.
New training paradigms require the support of corresponding training frameworks. Unlike the dominance of frameworks like Megatron-LM in the traditional LLM pre-training domain, the current landscape for large-scale distributed reinforcement learning training frameworks is diverse. Here are several currently popular or noteworthy RL training frameworks:
Features: Built on Ray, supports integration with mainstream training/inference systems like FSDP, Megatron-LM, vLLM. Designed for easy extension with new RL algorithms and offers good performance.
Features: Can utilize Accelerate to integrate DeepSpeed for acceleration. Comparatively, less deeply integrated with dedicated inference frameworks, more focused on research and experimental scenarios.
Features: Implemented based on Megatron-LM and TensorRT-LLM. Community activity is relatively low at present.
License: Apache-2.0 License
Popularity: GitHub ~0.7k stars
Framework Trend Analysis:
From a technical standpoint, frameworks like verl, which leverage Ray for distributed scheduling while integrating mainstream distributed training libraries (like Megatron-LM) and efficient inference engines (like vLLM), represent a highly promising approach for large-scale model reinforcement learning. This is because they can effectively reuse mature components and adaptation experiences from the existing large model ecosystem.
Almost all open-source models use RoPE (Rotary Position Embedding) based on the same theory from RoFormer: Enhanced Transformer with Rotary Position Embedding. However, there are two ways to implement RoPE: the GPT-J style and the GPT-NeoX style.
The GPT-J style is identical to the original RoFormer, using an interleaved method to calculate RoPE. The GPT-NeoX style uses an alternative, non-interleaved method. According to the Eleuther AI blog, they considered the original implementation inefficient and thus improved it by splitting the dimension into two halves (non-interleaved). Note that the GPT-NeoX and GPT-J styles produce different results.
The GPT-NeoX style RoPE calculation is as follows:
import torch classRotaryEmbedding(torch.nn.Module): def__init__(self, dim, base=10000): super().__init__() inv_freq =1.0/(base **(torch.arange(0, dim,2).float()/ dim)) self.register_buffer("inv_freq", inv_freq) defforward(self, max_seq_len): seq = torch.arange(max_seq_len, device=self.inv_freq.device, dtype=self.inv_freq.dtype) freqs = torch.outer(seq, self.inv_freq) emb = torch.cat((freqs, freqs), dim=-1) cos = emb.cos()[:,None,None,:] sin = emb.sin()[:,None,None,:] return cos, sin def_rotate_half(x): x1, x2 = torch.chunk(x,2, dim=-1) return torch.cat((-x2, x1), dim=-1) defapply_rotary_pos_emb(t, cos, sin): return t * cos + _rotate_half(t)* sin
The GPT-J style has two ways to implement RoPE, with the complex number method being more intuitive.
Due to the significant influence of the Hugging Face community, many people believe Llama uses the GPT-NeoX style based on its inference code in the transformers library. However, this is not the case. In Llama's original code, it implements the GPT-J style RoPE using the complex number method. So, why the difference between the two codebases? The answer lies in this issue. In the weight conversion script, they permuted the weights of q_proj and k_proj.
It's not immediately obvious why this works. We will come back to explain everything later.
A similar situation occurred with DeepSeek-V3. In the original code for deepseek-v3, it uses the same complex number method as Llama to compute GPT-J style RoPE (in fact, their code is very similar). Again, in the Hugging Face code for deepseek-v3, it uses a style similar to GPT-NeoX, just like Llama (again, the code is very similar), but with an exception on lines 364 and 367. Yes, it's very similar to the permute function mentioned above.
q = q.view(b, s, h, d //2,2).transpose(4,3).reshape(b, h, s, d) k = k.view(b, s, h, d //2,2).transpose(4,3).reshape(b, h, s, d)
So, what is the difference between these two implementations? Many people have the same question, as seen in this discussion.
Since RoPE only acts on the last dimension, a simple example helps understand why.
Now, let's look at the DeepSeek style (as implemented in Hugging Face, applying the permutation from the transpose operation before applying the NeoX-style rotation):
First, permute the input vector d:
dpermuted=d0d2d4d1d3d5
Then apply the NeoX-style rotation logic to dpermuted:
We find that the DeepSeek style (permute then apply NeoX RoPE) simply produces a permuted version of the GPT-J style result. Specifically, the resulting vector is [r0,r2,r4,r1,r3,r5] where r is the result vector from the GPT-J style calculation.
Recall how attention is calculated, using the dot product of q and k:
Attention(Q,K,V)=softmax(dkQKT)V
When q and k are permuted in the same way, their dot product remains unchanged. Let P be the permutation matrix. Then (Pq)T(Pk)=qTPTPk. If P is a permutation matrix representing the specific swap used, PTP=I (identity matrix), so qTPTPk=qTIk=qTk. The dot product result is the same.
So the DeepSeek style (in Hugging Face) is actually equivalent to the GPT-J style in terms of the final attention scores.
Now we can return to the Llama code in transformers. Permuting the qw and kw weights beforehand has the same effect as permuting the resulting q and k vectors after the matrix multiplication but before applying RoPE.
In the regular approach (no weight permutation), applying the linear layer:
Applying the linear layer with permuted weights yields a result vector q that is already permuted in the same way as the DeepSeek style's explicit permutation of q. Then, the Hugging Face Llama code applies the NeoX-style RoPE to this already-permuted vector, which, as shown above, is equivalent to applying the GPT-J style RoPE to the original, unpermuted q.
Now, we understand the complete picture:
Llama used GPT-J style RoPE during training. However, when converting its original weights to the Hugging Face format, it permuted the qw and kw weights and used a GPT-NeoX-like style for inference (for performance reasons).
DeepSeek also used GPT-J style RoPE during training, but it forgot to permute the qw and kw weights during the weight conversion. Therefore, it needed to add the permutation of q and k within the transformer's inference code (thus gaining no performance benefit from using the NeoX RoPE calculation itself).
One last question remains: if the GPT-NeoX style RoPE calculation is more performant, why do most open-source models still use the GPT-J style RoPE during training? The answer might be related to long context window extension. I haven't delved deeply into this issue yet.
Now that we understand the RoPE situation for Llama and DeepSeek, unfortunately, the story is far from over. Many AI frameworks copied the Hugging Face code for DeepSeek, leading to unnecessary complexity in their training code.
Rope (Rotary Positional Embedding) is a type of relative positional encoding. Currently, most mainstream large models use Rope or its variants. The original paper on Rope can be found in RoFormer: Enhanced Transformer with Rotary Position Embedding.
Since self-attention computation is position-independent, positional encoding has been added since the invention of the transformer to capture dependencies between different positions. The transformer uses absolute positional encoding.
Because absolute positional encoding is directly added to the token embedding, it cannot directly model the relative positions between tokens. The inference performance on sequences exceeding the training data length drops sharply. Relative positional encoding is used to correct this problem, and Rope has become the mainstream approach.
The core idea of Rope is to find a positional encoding function such that the following equation holds:
⟨f(qm,m),f(kn,n)⟩=g(qm,kn,m−n)
That is, when calculating the dot product of q and k during attention, the result is independent of the absolute positions m and n of the tokens in q and k, and only depends on the relative position m−n.
When the embedding dimension d is only 2, the following formulas precisely satisfy the above property:
Using the above three formulas, the Rope formula above can be easily derived.
According to Euler's formula:
eiϕ=cos(ϕ)+isin(ϕ)
Expanding this, we can obtain f, which is consistent for both q and k:
f(qm,m)=(cosmθsinmθ−sinmθcosmθ)(qm(1)qm(2))
Actually, we can understand the above Rope formula from another more intuitive perspective. The geometric meaning of the dot product of two-dimensional vectors is the product of their lengths multiplied by the cosine of the angle between them. The above Rope positional encoding function is equivalent to rotating the vector while keeping its length unchanged. Therefore, calculating the dot product of two rotated vectors only involves the relative rotation angle and is independent of the absolute angles.
Once the d=2 scenario is understood, it becomes relatively easy to understand the scenario where d is any even number. The embedding dimension is divided into pairs, and different θi=10000−2(i−1)/d,i∈[1,2,...,d/2] are applied according to the pair number, resulting in the complete Rope positional encoding function.
The core concept of RoPE is to rotate the embedding vectors based on the token's position. This is achieved by applying a rotation matrix to the token's embedding, where the rotation angle is determined by the token's position in the sequence. By rotating the embeddings instead of using fixed position encodings, the model can maintain more flexible and continuous position information.
Although Rope apply relative position embedding, it is still limit in generalizing past the context windows seen during training. Several extension methods were proposed. YaRN is the most popular one between
performance and complexity. The original paper can be found in YaRN: Efficient Context Window Extension of Large Language Models.
The key point is that when θi is small enough, it rotate slow, the max rotated angle is ri=2πθiL. if ri<1, it didn't go through the whole cycle during training, so we should interpolate the θi. If ri>τ, it can safely extrapolate. A linear translation is applied between the two conditions.
YaRN also add a scale weight to softmax, which is:
λ=(1+0.1lnLL′)2
It is just an experience value without theory support.
Occasionally, users encounter difficulties when attempting to send EPUB files to their Kindle devices using Amazon's "Send to Kindle" service. The underlying causes for these conversion failures remain elusive, presenting a challenge for consistent troubleshooting.
However, a practical workaround often proves effective:
Intermediate AZW3 Conversion:
Convert the problematic EPUB file to the AZW3 format, which is a Kindle-native format.
Subsequently, convert the AZW3 file back to EPUB.
This two-step conversion process can effectively address certain compatibility issues that may be hindering the initial "Send to Kindle" conversion. By introducing this intermediate format, the file undergoes a re-processing that often resolves underlying formatting or encoding conflicts.
While this workaround is frequently successful, it's important to note that the specific reasons for the original conversion failures can vary. Amazon's conversion algorithms and the inherent complexities of EPUB formatting contribute to this variability.
The IEEE 754 standard defines how floating-point numbers are represented in computers. A number, V, is expressed as:
V=(−1)s×M×2E
Where:
s (Sign): Determines the sign of the number: s=0 for positive, s=1 for negative.
M (Significand/Mantissa): A fractional binary number. It ranges from 1 to 2−ϵ for normalized values, or from 0 to 1−ϵ for denormalized values, where ϵ is the machine epsilon.
Floating-point numbers can represent various special values, defined by the exponent (e) and fraction (f) fields:
Category
Condition
Value
Normalized Values
0<e<2k−1
(−1)s×(1+f)×2e−bias
Denormalized Values
e=0
(−1)s×f×21−bias
Infinity
e=2k−1, f=0
(−1)s×∞
NaN (Not a Number)
e=2k−1, f=0
NaN
Where the bias is 2k−1−1.
Denormalized numbers serve two crucial purposes:
Representation of Zero: They allow for distinct representations of positive (+0.0) and negative (−0.0) zero, differentiated by the sign bit.
Representation of Values Close to Zero: They enable the representation of numbers very close to 0.0, filling the gap between zero and the smallest normalized number.
To align 3.14 with 1×1010, its significand must be right-shifted by 32 bits. Due to the 23-bit fraction field in single-precision floats, 3.14 effectively becomes 0.0.
As Apple and Homebrew ceased support for my 2015 MacBook Pro, I turned to OpenCore Patcher to ensure continued functionality. However, recently, many Python packages (pypi) no longer support Intel macOS versions, leading me to free approximately 400GB of disk space and instal Linux Mint alongside macOS.
Upon installation, the OpenCore Patcher boot menu unexpectedly disappeared after rebooting into Linux Mint. This issue arises because Linux Mint put it the first one in the boot order. Each operating system—whether it’s macOS or Linux—installs its own bootloader in the EFI partition. In my case, the EFI partition was mounted under /boot/efi when in Linux Mint.
The original configuration file for OpenCore is located at /boot/efi/EFI/OC/config.plist.
Backup this file before making any changes.
Navigate to Misc -> BlessOverride within the config.plist editor.
Change the value from \EFI\Microsoft\Boot\bootmgfw.efi to \EFI\ubuntu\grubx64.efi.
Restart and Access OpenCore Menu:
After making these changes, restart your computer while holding down the Option key.
From the boot menu, select "OpenCore" to access the EFI menu.
Set OpenCore as Default Bootloader:
Once inside the OpenCore menu, it will automatically set as the first entry in the boot order.
This means you won’t need to press Option every time you reboot your system.
By following these steps, I have successfully restored the functionality of my dual-boot setup, ensuring a seamless transition between macOS and Linux Mint while retaining control over which operating system boots by default.