pymllm.layers.rope

Functions

apply_rope(q, k, indptr, offsets[, inplace, ...])

Apply rotary embedding to a batch of queries/keys (stored as RaggedTensor).

apply_llama31_rope(q, k, indptr, offsets[, inplace, ...])

Apply Llama 3.1 style rotary embedding to a batch of queries/keys.

apply_rope_pos_ids(q, k, pos_ids[, inplace, ...])

Apply rotary embedding using explicit per-token position IDs.

apply_llama31_rope_pos_ids(q, k, pos_ids[, inplace, ...])

Apply Llama 3.1 style RoPE using explicit per-token position IDs.

apply_rope_with_cos_sin_cache(positions, query, key, ...)

Apply rotary embedding with precomputed cos/sin cache.

apply_mrope(q, k, positions, cos_sin_cache, mrope_section)

Apply multi-dimensional rotary position embedding (M-RoPE).

Module Contents

pymllm.layers.rope.apply_rope(q, k, indptr, offsets, inplace=False, rotary_dim=None, interleave=False, rope_scale=1.0, rope_theta=10000.0)

Apply rotary embedding to a batch of queries/keys (stored as RaggedTensor).

cos/sin values are computed on the fly inside the kernel. Position offsets are provided per-segment via indptr and offsets.

Parameters:
  • q (torch.Tensor) – Query ragged tensor, shape (nnz, num_q_heads, head_dim).

  • k (torch.Tensor) – Key ragged tensor, shape (nnz, num_k_heads, head_dim).

  • indptr (torch.Tensor) – Indptr tensor, shape (batch_size + 1,). The i-th segment spans q[indptr[i]:indptr[i+1]].

  • offsets (torch.Tensor) – Relative position offsets per segment, shape (batch_size,).

  • inplace (bool) – If True, apply RoPE in-place and return None. If False, return new (q_rope, k_rope) tensors.

  • rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to. None means the entire head_dim.

  • interleave (bool) – If True, rotate even/odd dims ([..., ::2] / [..., 1::2]). If False, rotate first/second half dims.

  • rope_scale (float) – Scaling factor for position indices.

  • rope_theta (float) – Base frequency theta.

Returns:

None when inplace is True, otherwise a tuple (q_rope, k_rope) of rotated tensors with the same shapes as the inputs.

Return type:

Optional[Tuple[torch.Tensor, torch.Tensor]]

pymllm.layers.rope.apply_llama31_rope(q, k, indptr, offsets, inplace=False, rotary_dim=None, interleave=False, rope_scale=8.0, rope_theta=500000.0, low_freq_factor=1.0, high_freq_factor=4.0, old_context_len=8192)

Apply Llama 3.1 style rotary embedding to a batch of queries/keys.

This variant adjusts frequencies with low_freq_factor, high_freq_factor, and old_context_len following the Llama 3.1 RoPE recipe. cos/sin values are computed on the fly.

Parameters:
  • q (torch.Tensor) – Query ragged tensor, shape (nnz, num_q_heads, head_dim).

  • k (torch.Tensor) – Key ragged tensor, shape (nnz, num_k_heads, head_dim).

  • indptr (torch.Tensor) – Indptr tensor, shape (batch_size + 1,).

  • offsets (torch.Tensor) – Relative position offsets per segment, shape (batch_size,).

  • inplace (bool) – If True, apply in-place and return None.

  • rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to. None means the entire head_dim.

  • interleave (bool) – If True, rotate even/odd dims; otherwise first/second half dims.

  • rope_scale (float) – Scaling factor for position indices (default 8).

  • rope_theta (float) – Base frequency theta (default 5e5).

  • low_freq_factor (float) – Low frequency factor for Llama 3.1 RoPE.

  • high_freq_factor (float) – High frequency factor for Llama 3.1 RoPE.

  • old_context_len (int) – Original context length for Llama 3.1 RoPE.

Returns:

None when inplace is True, otherwise (q_rope, k_rope).

Return type:

Optional[Tuple[torch.Tensor, torch.Tensor]]

pymllm.layers.rope.apply_rope_pos_ids(q, k, pos_ids, inplace=False, rotary_dim=None, interleave=False, rope_scale=1.0, rope_theta=10000.0)

Apply rotary embedding using explicit per-token position IDs.

Unlike apply_rope() which derives positions from indptr / offsets, this function takes a flat pos_ids tensor that supplies an explicit position for every token.

Parameters:
  • q (torch.Tensor) – Query tensor, shape (nnz, num_q_heads, head_dim).

  • k (torch.Tensor) – Key tensor, shape (nnz, num_k_heads, head_dim).

  • pos_ids (torch.Tensor) – Position indices, shape (nnz,).

  • inplace (bool) – If True, apply in-place and return None.

  • rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to.

  • interleave (bool) – Interleaved layout flag.

  • rope_scale (float) – Scaling factor for position indices.

  • rope_theta (float) – Base frequency theta.

Returns:

None when inplace is True, otherwise (q_rope, k_rope).

Return type:

Optional[Tuple[torch.Tensor, torch.Tensor]]

pymllm.layers.rope.apply_llama31_rope_pos_ids(q, k, pos_ids, inplace=False, rotary_dim=None, interleave=False, rope_scale=8.0, rope_theta=500000.0, low_freq_factor=1.0, high_freq_factor=4.0, old_context_len=8192)

Apply Llama 3.1 style RoPE using explicit per-token position IDs.

Combines Llama 3.1 frequency adjustments with explicit pos_ids.

Parameters:
  • q (torch.Tensor) – Query tensor, shape (nnz, num_q_heads, head_dim).

  • k (torch.Tensor) – Key tensor, shape (nnz, num_k_heads, head_dim).

  • pos_ids (torch.Tensor) – Position indices, shape (nnz,).

  • inplace (bool) – If True, apply in-place and return None.

  • rotary_dim (Optional[int]) – Number of dimensions to apply RoPE to.

  • interleave (bool) – Interleaved layout flag.

  • rope_scale (float) – Scaling factor (default 8).

  • rope_theta (float) – Base frequency theta (default 5e5).

  • low_freq_factor (float) – Low frequency factor for Llama 3.1 RoPE.

  • high_freq_factor (float) – High frequency factor for Llama 3.1 RoPE.

  • old_context_len (int) – Original context length for Llama 3.1 RoPE.

Returns:

None when inplace is True, otherwise (q_rope, k_rope).

Return type:

Optional[Tuple[torch.Tensor, torch.Tensor]]

pymllm.layers.rope.apply_rope_with_cos_sin_cache(positions, query, key, head_size, cos_sin_cache, inplace=False, is_neox=True)

Apply rotary embedding with precomputed cos/sin cache.

Compatible with SGL/vLLM implementations. Note that query and key use a flattened head layout (nnz, num_heads * head_size) instead of the 3-D layout used by the other apply_rope* functions.

Parameters:
  • positions (torch.Tensor) – Position indices, shape (nnz,).

  • query (torch.Tensor) – Query tensor, shape (nnz, num_q_heads * head_size).

  • key (torch.Tensor) – Key tensor, shape (nnz, num_k_heads * head_size).

  • head_size (int) – Size of each attention head.

  • cos_sin_cache (torch.Tensor) – Precomputed cos/sin tensor, shape (max_seq_len, rotary_dim). The first half of rotary_dim stores cosine values, the second half stores sine values.

  • inplace (bool) – If True, apply in-place and return None.

  • is_neox (bool) – If True (default), use GPT-NeoX style (rotate first/second half dims). If False, use interleaved style (rotate even/odd dims).

Returns:

None when inplace is True, otherwise (query_out, key_out) with the same shapes as the inputs.

Return type:

Optional[Tuple[torch.Tensor, torch.Tensor]]

pymllm.layers.rope.apply_mrope(q, k, positions, cos_sin_cache, mrope_section, mrope_interleaved=True)

Apply multi-dimensional rotary position embedding (M-RoPE).

Used by Qwen3-VL which assigns independent (t, h, w) position indices to each token. For text tokens all three indices are the same sequential value; for image tokens they follow the spatial grid layout.

Parameters:
  • q (torch.Tensor) – Query tensor, shape (T, num_q_heads, head_dim).

  • k (torch.Tensor) – Key tensor, shape (T, num_kv_heads, head_dim).

  • positions (torch.Tensor) – 3-D position IDs, shape (3, T) — rows are (temporal, height, width) position indices.

  • cos_sin_cache (torch.Tensor) – Precomputed cache, shape (max_pos, head_dim). The first head_dim // 2 columns are cosine values and the remaining columns are sine values, each for frequencies 0, 1, ..., head_dim // 2 - 1.

  • mrope_section (List[int]) – Three integers [s_t, s_h, s_w] that partition the head_dim // 2 rotary frequency dimensions among the temporal, height, and width components. sum(mrope_section) must equal head_dim // 2.

  • mrope_interleaved (bool) – When True (Qwen3-VL default), uses the interleaved layout where frequency dimensions are cycled (t, h, w, t, h, w, ...) rather than grouped consecutively.

Returns:

(q_rope, k_rope) with the same shapes as the inputs.

Return type:

Tuple[torch.Tensor, torch.Tensor]