pymllm.orchestrator.group_coordinator

GroupCoordinator for distributed communication.

Classes

GroupCoordinator

Manages a group of processes for distributed communication.

Functions

divide(numerator, denominator)

Divide and ensure divisibility.

split_tensor_along_dim(tensor, dim, world_size, rank)

Split tensor along a dimension for tensor parallelism.

Module Contents

class pymllm.orchestrator.group_coordinator.GroupCoordinator(ranks, local_rank, backend='nccl')

Manages a group of processes for distributed communication.

Lightweight wrapper around torch.distributed.ProcessGroup.

Parameters:
  • ranks (List[int]) – List of global ranks in this group

  • local_rank (int) – Local rank for device assignment

  • backend (str) – Backend to use (nccl, gloo, etc.)

ranks
local_rank
backend = 'nccl'
world_size
rank_in_group
all_reduce(tensor)

All-reduce across the group.

Parameters:

tensor (torch.Tensor)

Return type:

torch.Tensor

all_gather(tensor, dim=0)

All-gather across the group.

Parameters:
  • tensor (torch.Tensor)

  • dim (int)

Return type:

torch.Tensor

broadcast(tensor, src=0)

Broadcast from source rank to all.

Parameters:
  • tensor (torch.Tensor) – Tensor to broadcast.

  • src (int) – Source rank relative to this group (0 <= src < world_size).

Return type:

torch.Tensor

pymllm.orchestrator.group_coordinator.divide(numerator, denominator)

Divide and ensure divisibility.

Parameters:
  • numerator (int)

  • denominator (int)

Return type:

int

pymllm.orchestrator.group_coordinator.split_tensor_along_dim(tensor, dim, world_size, rank)

Split tensor along a dimension for tensor parallelism.

Parameters:
  • tensor (torch.Tensor)

  • dim (int)

  • world_size (int)

  • rank (int)

Return type:

torch.Tensor