pymllm.layers.rope¶

Attributes¶

triton

Functions¶

`apply_rope`(q, k, indptr, offsets[, inplace, ...])	Apply rotary embedding to a batch of queries/keys (stored as RaggedTensor).
`apply_llama31_rope`(q, k, indptr, offsets[, inplace, ...])	Apply Llama 3.1 style rotary embedding to a batch of queries/keys.
`apply_rope_pos_ids`(q, k, pos_ids[, inplace, ...])	Apply rotary embedding using explicit per-token position IDs.
`apply_llama31_rope_pos_ids`(q, k, pos_ids[, inplace, ...])	Apply Llama 3.1 style RoPE using explicit per-token position IDs.
`apply_rope_with_cos_sin_cache`(positions, query, key, ...)	Apply rotary embedding with precomputed cos/sin cache.
`apply_mrope`(q, k, positions, cos_sin_cache, mrope_section)	Apply multi-dimensional rotary position embedding (M-RoPE).
`apply_mrope_fused_`(q, k, positions, cos_sin_cache, ...)	Apply M-RoPE in place with a Triton fused kernel when available.

Module Contents¶

pymllm.layers.rope.triton = None¶

pymllm.layers.rope.apply_rope(q, k, indptr, offsets, inplace=False, rotary_dim=None, interleave=False, rope_scale=1.0, rope_theta=10000.0)¶

Apply rotary embedding to a batch of queries/keys (stored as RaggedTensor).

cos/sin values are computed on the fly inside the kernel. Position offsets are provided per-segment via indptr and offsets.

Parameters:

q (torch.Tensor) – Query ragged tensor, shape (nnz, num_q_heads, head_dim).
k (torch.Tensor) – Key ragged tensor, shape (nnz, num_k_heads, head_dim).
indptr (torch.Tensor) – Indptr tensor, shape (batch_size + 1,). The i-th segment spans q[indptr[i]:indptr[i+1]].
offsets (torch.Tensor) – Relative position offsets per segment, shape (batch_size,).
inplace (bool) – If True, apply RoPE in-place and return None. If False, return new (q_rope, k_rope) tensors.
rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to. None means the entire head_dim.
interleave (bool) – If True, rotate even/odd dims ([..., ::2] / [..., 1::2]). If False, rotate first/second half dims.
rope_scale (float) – Scaling factor for position indices.
rope_theta (float) – Base frequency theta.

Returns:

None when inplace is True, otherwise a tuple (q_rope, k_rope) of rotated tensors with the same shapes as the inputs.

Return type:

Optional[Tuple[torch.Tensor, torch.Tensor]]

pymllm.layers.rope.apply_llama31_rope(q, k, indptr, offsets, inplace=False, rotary_dim=None, interleave=False, rope_scale=8.0, rope_theta=500000.0, low_freq_factor=1.0, high_freq_factor=4.0, old_context_len=8192)¶

Apply Llama 3.1 style rotary embedding to a batch of queries/keys.

This variant adjusts frequencies with low_freq_factor, high_freq_factor, and old_context_len following the Llama 3.1 RoPE recipe. cos/sin values are computed on the fly.

Parameters:

q (torch.Tensor) – Query ragged tensor, shape (nnz, num_q_heads, head_dim).
k (torch.Tensor) – Key ragged tensor, shape (nnz, num_k_heads, head_dim).
indptr (torch.Tensor) – Indptr tensor, shape (batch_size + 1,).
offsets (torch.Tensor) – Relative position offsets per segment, shape (batch_size,).
inplace (bool) – If True, apply in-place and return None.
rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to. None means the entire head_dim.
interleave (bool) – If True, rotate even/odd dims; otherwise first/second half dims.
rope_scale (float) – Scaling factor for position indices (default 8).
rope_theta (float) – Base frequency theta (default 5e5).
low_freq_factor (float) – Low frequency factor for Llama 3.1 RoPE.
high_freq_factor (float) – High frequency factor for Llama 3.1 RoPE.
old_context_len (int) – Original context length for Llama 3.1 RoPE.

Returns:

None when inplace is True, otherwise (q_rope, k_rope).

Return type:

Optional[Tuple[torch.Tensor, torch.Tensor]]

pymllm.layers.rope.apply_rope_pos_ids(q, k, pos_ids, inplace=False, rotary_dim=None, interleave=False, rope_scale=1.0, rope_theta=10000.0)¶

Apply rotary embedding using explicit per-token position IDs.

Unlike apply_rope() which derives positions from indptr / offsets, this function takes a flat pos_ids tensor that supplies an explicit position for every token.

Parameters:

q (torch.Tensor) – Query tensor, shape (nnz, num_q_heads, head_dim).
k (torch.Tensor) – Key tensor, shape (nnz, num_k_heads, head_dim).
pos_ids (torch.Tensor) – Position indices, shape (nnz,).
inplace (bool) – If True, apply in-place and return None.
rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to.
interleave (bool) – Interleaved layout flag.
rope_scale (float) – Scaling factor for position indices.
rope_theta (float) – Base frequency theta.

Returns:

None when inplace is True, otherwise (q_rope, k_rope).

Return type:

Optional[Tuple[torch.Tensor, torch.Tensor]]

pymllm.layers.rope.apply_llama31_rope_pos_ids(q, k, pos_ids, inplace=False, rotary_dim=None, interleave=False, rope_scale=8.0, rope_theta=500000.0, low_freq_factor=1.0, high_freq_factor=4.0, old_context_len=8192)¶

Apply Llama 3.1 style RoPE using explicit per-token position IDs.

Combines Llama 3.1 frequency adjustments with explicit pos_ids.

Parameters:

q (torch.Tensor) – Query tensor, shape (nnz, num_q_heads, head_dim).
k (torch.Tensor) – Key tensor, shape (nnz, num_k_heads, head_dim).
pos_ids (torch.Tensor) – Position indices, shape (nnz,).
inplace (bool) – If True, apply in-place and return None.
rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to.
interleave (bool) – Interleaved layout flag.
rope_scale (float) – Scaling factor (default 8).
rope_theta (float) – Base frequency theta (default 5e5).
low_freq_factor (float) – Low frequency factor for Llama 3.1 RoPE.
high_freq_factor (float) – High frequency factor for Llama 3.1 RoPE.
old_context_len (int) – Original context length for Llama 3.1 RoPE.

Returns:

None when inplace is True, otherwise (q_rope, k_rope).

Return type:

Optional[Tuple[torch.Tensor, torch.Tensor]]

pymllm.layers.rope.apply_rope_with_cos_sin_cache(positions, query, key, head_size, cos_sin_cache, inplace=False, is_neox=True)¶

Apply rotary embedding with precomputed cos/sin cache.

Compatible with SGL/vLLM implementations. Note that query and key use a flattened head layout (nnz, num_heads * head_size) instead of the 3-D layout used by the other apply_rope* functions.

Parameters:

positions (torch.Tensor) – Position indices, shape (nnz,).
query (torch.Tensor) – Query tensor, shape (nnz, num_q_heads * head_size).
key (torch.Tensor) – Key tensor, shape (nnz, num_k_heads * head_size).
head_size (int) – Size of each attention head.
cos_sin_cache (torch.Tensor) – Precomputed cos/sin tensor, shape (max_seq_len, rotary_dim). The first half of rotary_dim stores cosine values, the second half stores sine values.
inplace (bool) – If True, apply in-place and return None.
is_neox (bool) – If True (default), use GPT-NeoX style (rotate first/second half dims). If False, use interleaved style (rotate even/odd dims).

Returns:

None when inplace is True, otherwise (query_out, key_out) with the same shapes as the inputs.

Return type:

Optional[Tuple[torch.Tensor, torch.Tensor]]

pymllm.layers.rope.apply_mrope(q, k, positions, cos_sin_cache, mrope_section, mrope_interleaved=True)¶

Apply multi-dimensional rotary position embedding (M-RoPE).

Used by Qwen3-VL which assigns independent (t, h, w) position indices to each token. For text tokens all three indices are the same sequential value; for image tokens they follow the spatial grid layout.

Parameters:

q (torch.Tensor) – Query tensor, shape (T, num_q_heads, head_dim).
k (torch.Tensor) – Key tensor, shape (T, num_kv_heads, head_dim).
positions (torch.Tensor) – 3-D position IDs, shape (3, T) — rows are (temporal, height, width) position indices.
cos_sin_cache (torch.Tensor) – Precomputed cache, shape (max_pos, head_dim). The first head_dim // 2 columns are cosine values and the remaining columns are sine values, each for frequencies 0, 1, ..., head_dim // 2 - 1.
mrope_section (List[int]) – Three integers [s_t, s_h, s_w] that partition the head_dim // 2 rotary frequency dimensions among the temporal, height, and width components. sum(mrope_section) must equal head_dim // 2.
mrope_interleaved (bool) – When True (Qwen3-VL default), uses the interleaved layout where frequency dimensions are cycled (t, h, w, t, h, w, ...) rather than grouped consecutively.

Returns:

(q_rope, k_rope) with the same shapes as the inputs.

Return type:

Tuple[torch.Tensor, torch.Tensor]

pymllm.layers.rope.apply_mrope_fused_(q, k, positions, cos_sin_cache, mrope_section, mrope_interleaved=True)¶

Apply M-RoPE in place with a Triton fused kernel when available.

The fallback returns the reference PyTorch outputs, preserving functional correctness on CPU or non-contiguous inputs.

Parameters:

q (torch.Tensor)
k (torch.Tensor)
positions (torch.Tensor)
cos_sin_cache (torch.Tensor)
mrope_section (List[int])
mrope_interleaved (bool)

Return type:

Tuple[torch.Tensor, torch.Tensor]