pymllm.layers.rope¶
Functions¶
|
Apply rotary embedding to a batch of queries/keys (stored as RaggedTensor). |
|
Apply Llama 3.1 style rotary embedding to a batch of queries/keys. |
|
Apply rotary embedding using explicit per-token position IDs. |
|
Apply Llama 3.1 style RoPE using explicit per-token position IDs. |
|
Apply rotary embedding with precomputed cos/sin cache. |
|
Apply multi-dimensional rotary position embedding (M-RoPE). |
Module Contents¶
- pymllm.layers.rope.apply_rope(q, k, indptr, offsets, inplace=False, rotary_dim=None, interleave=False, rope_scale=1.0, rope_theta=10000.0)¶
Apply rotary embedding to a batch of queries/keys (stored as RaggedTensor).
cos/sin values are computed on the fly inside the kernel. Position offsets are provided per-segment via
indptrandoffsets.- Parameters:
q (torch.Tensor) – Query ragged tensor, shape
(nnz, num_q_heads, head_dim).k (torch.Tensor) – Key ragged tensor, shape
(nnz, num_k_heads, head_dim).indptr (torch.Tensor) – Indptr tensor, shape
(batch_size + 1,). The i-th segment spansq[indptr[i]:indptr[i+1]].offsets (torch.Tensor) – Relative position offsets per segment, shape
(batch_size,).inplace (bool) – If
True, apply RoPE in-place and returnNone. IfFalse, return new(q_rope, k_rope)tensors.rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to.
Nonemeans the entirehead_dim.interleave (bool) – If
True, rotate even/odd dims ([..., ::2]/[..., 1::2]). IfFalse, rotate first/second half dims.rope_scale (float) – Scaling factor for position indices.
rope_theta (float) – Base frequency theta.
- Returns:
Nonewhen inplace isTrue, otherwise a tuple(q_rope, k_rope)of rotated tensors with the same shapes as the inputs.- Return type:
Optional[Tuple[torch.Tensor, torch.Tensor]]
- pymllm.layers.rope.apply_llama31_rope(q, k, indptr, offsets, inplace=False, rotary_dim=None, interleave=False, rope_scale=8.0, rope_theta=500000.0, low_freq_factor=1.0, high_freq_factor=4.0, old_context_len=8192)¶
Apply Llama 3.1 style rotary embedding to a batch of queries/keys.
This variant adjusts frequencies with
low_freq_factor,high_freq_factor, andold_context_lenfollowing the Llama 3.1 RoPE recipe. cos/sin values are computed on the fly.- Parameters:
q (torch.Tensor) – Query ragged tensor, shape
(nnz, num_q_heads, head_dim).k (torch.Tensor) – Key ragged tensor, shape
(nnz, num_k_heads, head_dim).indptr (torch.Tensor) – Indptr tensor, shape
(batch_size + 1,).offsets (torch.Tensor) – Relative position offsets per segment, shape
(batch_size,).inplace (bool) – If
True, apply in-place and returnNone.rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to.
Nonemeans the entirehead_dim.interleave (bool) – If
True, rotate even/odd dims; otherwise first/second half dims.rope_scale (float) – Scaling factor for position indices (default
8).rope_theta (float) – Base frequency theta (default
5e5).low_freq_factor (float) – Low frequency factor for Llama 3.1 RoPE.
high_freq_factor (float) – High frequency factor for Llama 3.1 RoPE.
old_context_len (int) – Original context length for Llama 3.1 RoPE.
- Returns:
Nonewhen inplace isTrue, otherwise(q_rope, k_rope).- Return type:
Optional[Tuple[torch.Tensor, torch.Tensor]]
- pymllm.layers.rope.apply_rope_pos_ids(q, k, pos_ids, inplace=False, rotary_dim=None, interleave=False, rope_scale=1.0, rope_theta=10000.0)¶
Apply rotary embedding using explicit per-token position IDs.
Unlike
apply_rope()which derives positions fromindptr/offsets, this function takes a flatpos_idstensor that supplies an explicit position for every token.- Parameters:
q (torch.Tensor) – Query tensor, shape
(nnz, num_q_heads, head_dim).k (torch.Tensor) – Key tensor, shape
(nnz, num_k_heads, head_dim).pos_ids (torch.Tensor) – Position indices, shape
(nnz,).inplace (bool) – If
True, apply in-place and returnNone.rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to.
interleave (bool) – Interleaved layout flag.
rope_scale (float) – Scaling factor for position indices.
rope_theta (float) – Base frequency theta.
- Returns:
Nonewhen inplace isTrue, otherwise(q_rope, k_rope).- Return type:
Optional[Tuple[torch.Tensor, torch.Tensor]]
- pymllm.layers.rope.apply_llama31_rope_pos_ids(q, k, pos_ids, inplace=False, rotary_dim=None, interleave=False, rope_scale=8.0, rope_theta=500000.0, low_freq_factor=1.0, high_freq_factor=4.0, old_context_len=8192)¶
Apply Llama 3.1 style RoPE using explicit per-token position IDs.
Combines Llama 3.1 frequency adjustments with explicit
pos_ids.- Parameters:
q (torch.Tensor) – Query tensor, shape
(nnz, num_q_heads, head_dim).k (torch.Tensor) – Key tensor, shape
(nnz, num_k_heads, head_dim).pos_ids (torch.Tensor) – Position indices, shape
(nnz,).inplace (bool) – If
True, apply in-place and returnNone.rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to.
interleave (bool) – Interleaved layout flag.
rope_scale (float) – Scaling factor (default
8).rope_theta (float) – Base frequency theta (default
5e5).low_freq_factor (float) – Low frequency factor for Llama 3.1 RoPE.
high_freq_factor (float) – High frequency factor for Llama 3.1 RoPE.
old_context_len (int) – Original context length for Llama 3.1 RoPE.
- Returns:
Nonewhen inplace isTrue, otherwise(q_rope, k_rope).- Return type:
Optional[Tuple[torch.Tensor, torch.Tensor]]
- pymllm.layers.rope.apply_rope_with_cos_sin_cache(positions, query, key, head_size, cos_sin_cache, inplace=False, is_neox=True)¶
Apply rotary embedding with precomputed cos/sin cache.
Compatible with SGL/vLLM implementations. Note that
queryandkeyuse a flattened head layout(nnz, num_heads * head_size)instead of the 3-D layout used by the otherapply_rope*functions.- Parameters:
positions (torch.Tensor) – Position indices, shape
(nnz,).query (torch.Tensor) – Query tensor, shape
(nnz, num_q_heads * head_size).key (torch.Tensor) – Key tensor, shape
(nnz, num_k_heads * head_size).head_size (int) – Size of each attention head.
cos_sin_cache (torch.Tensor) – Precomputed cos/sin tensor, shape
(max_seq_len, rotary_dim). The first half ofrotary_dimstores cosine values, the second half stores sine values.inplace (bool) – If
True, apply in-place and returnNone.is_neox (bool) – If
True(default), use GPT-NeoX style (rotate first/second half dims). IfFalse, use interleaved style (rotate even/odd dims).
- Returns:
Nonewhen inplace isTrue, otherwise(query_out, key_out)with the same shapes as the inputs.- Return type:
Optional[Tuple[torch.Tensor, torch.Tensor]]
- pymllm.layers.rope.apply_mrope(q, k, positions, cos_sin_cache, mrope_section, mrope_interleaved=True)¶
Apply multi-dimensional rotary position embedding (M-RoPE).
Used by Qwen3-VL which assigns independent (t, h, w) position indices to each token. For text tokens all three indices are the same sequential value; for image tokens they follow the spatial grid layout.
- Parameters:
q (torch.Tensor) – Query tensor, shape
(T, num_q_heads, head_dim).k (torch.Tensor) – Key tensor, shape
(T, num_kv_heads, head_dim).positions (torch.Tensor) – 3-D position IDs, shape
(3, T)— rows are(temporal, height, width)position indices.cos_sin_cache (torch.Tensor) – Precomputed cache, shape
(max_pos, head_dim). The firsthead_dim // 2columns are cosine values and the remaining columns are sine values, each for frequencies0, 1, ..., head_dim // 2 - 1.mrope_section (List[int]) – Three integers
[s_t, s_h, s_w]that partition thehead_dim // 2rotary frequency dimensions among the temporal, height, and width components.sum(mrope_section)must equalhead_dim // 2.mrope_interleaved (bool) – When
True(Qwen3-VL default), uses the interleaved layout where frequency dimensions are cycled(t, h, w, t, h, w, ...)rather than grouped consecutively.
- Returns:
(q_rope, k_rope)with the same shapes as the inputs.- Return type:
Tuple[torch.Tensor, torch.Tensor]