pymllm.orchestrator.group_coordinator¶
GroupCoordinator for distributed communication.
Classes¶
Manages a group of processes for distributed communication. |
Functions¶
|
Divide and ensure divisibility. |
|
Split tensor along a dimension for tensor parallelism. |
Module Contents¶
- class pymllm.orchestrator.group_coordinator.GroupCoordinator(ranks, local_rank, backend='nccl')¶
Manages a group of processes for distributed communication.
Lightweight wrapper around torch.distributed.ProcessGroup.
- Parameters:
ranks (List[int]) – List of global ranks in this group
local_rank (int) – Local rank for device assignment
backend (str) – Backend to use (nccl, gloo, etc.)
- ranks¶
- local_rank¶
- backend = 'nccl'¶
- world_size¶
- rank_in_group¶
- all_reduce(tensor)¶
All-reduce across the group.
- Parameters:
tensor (torch.Tensor)
- Return type:
torch.Tensor
- all_gather(tensor, dim=0)¶
All-gather across the group.
- Parameters:
tensor (torch.Tensor)
dim (int)
- Return type:
torch.Tensor
- broadcast(tensor, src=0)¶
Broadcast from source rank to all.
- Parameters:
tensor (torch.Tensor) – Tensor to broadcast.
src (int) – Source rank relative to this group (0 <= src < world_size).
- Return type:
torch.Tensor
- pymllm.orchestrator.group_coordinator.divide(numerator, denominator)¶
Divide and ensure divisibility.
- Parameters:
numerator (int)
denominator (int)
- Return type:
int
- pymllm.orchestrator.group_coordinator.split_tensor_along_dim(tensor, dim, world_size, rank)¶
Split tensor along a dimension for tensor parallelism.
- Parameters:
tensor (torch.Tensor)
dim (int)
world_size (int)
rank (int)
- Return type:
torch.Tensor