_benchmark_tile_reduce_single() — pytorch Function Reference
Architecture documentation for the _benchmark_tile_reduce_single() function in bench_nvshmem_tile_reduce.py from the pytorch codebase.
Entity Profile
Dependency Diagram
graph TD 1fb81840_0784_4f41_94ba_38ef34fe84e5["_benchmark_tile_reduce_single()"] 2b246b4f_aea8_6874_ed04_0bed1d42b788["test_benchmark_tile_reduce_various_sizes()"] 2b246b4f_aea8_6874_ed04_0bed1d42b788 -->|calls| 1fb81840_0784_4f41_94ba_38ef34fe84e5 89ca7f4b_c6a3_7849_81c9_ba6dfa92cd70["_init_device()"] 1fb81840_0784_4f41_94ba_38ef34fe84e5 -->|calls| 89ca7f4b_c6a3_7849_81c9_ba6dfa92cd70 style 1fb81840_0784_4f41_94ba_38ef34fe84e5 fill:#6366f1,stroke:#818cf8,color:#fff
Relationship Graph
Source Code
benchmarks/distributed/bench_nvshmem_tile_reduce.py lines 51–130
def _benchmark_tile_reduce_single(
self,
full_size: int,
tile_size: int,
warmup_iters: int = 5,
bench_iters: int = 10,
) -> dict:
"""
Benchmark a single configuration of tile reduce.
Args:
full_size: Size of the full matrix (full_size x full_size)
warmup_iters: Number of warmup iterations
bench_iters: Number of benchmark iterations
Returns:
Dictionary with benchmark results
"""
self._init_device()
group_name = dist.group.WORLD.group_name
dtype = torch.float
# Allocate full matrices
full_inp = symm_mem.empty(
full_size, full_size, dtype=dtype, device=self.device
).fill_(self.rank)
full_out = symm_mem.empty(
full_size, full_size, dtype=dtype, device=self.device
).fill_(0)
slice_ut = slice(0, tile_size)
inp_tile = full_inp[slice_ut, slice_ut]
out_tile = full_out[slice_ut, slice_ut]
root = 0
# Warmup iterations
for _ in range(warmup_iters):
torch.ops.symm_mem.tile_reduce(inp_tile, out_tile, root, group_name)
torch.cuda.synchronize(self.device)
# Benchmark iterations
times = []
dist.barrier()
torch.cuda.synchronize(self.device)
start_time = time.perf_counter()
for _ in range(bench_iters):
torch.ops.symm_mem.tile_reduce(inp_tile, out_tile, root, group_name)
torch.cuda.synchronize(self.device)
end_time = time.perf_counter()
times.append((end_time - start_time) / bench_iters)
# Calculate statistics
times = torch.tensor(times, dtype=torch.float64)
tile_elements = tile_size * tile_size
tile_bytes = (
tile_elements * dtype.itemsize
if hasattr(dtype, "itemsize")
else tile_elements * 4
)
results = {
"full_size": full_size,
"tile_size": tile_size,
"tile_elements": tile_elements,
"tile_bytes": tile_bytes,
"world_size": self.world_size,
"mean_time_ms": times.mean().item() * 1000,
"std_time_ms": times.std().item() * 1000,
"min_time_ms": times.min().item() * 1000,
"max_time_ms": times.max().item() * 1000,
"throughput_gb_s": tile_bytes / (times.mean().item() * 1e9),
"elements_per_sec": tile_elements / times.mean().item(),
}
return results
Domain
Subdomains
Calls
Source
Frequently Asked Questions
What does _benchmark_tile_reduce_single() do?
_benchmark_tile_reduce_single() is a function in the pytorch codebase.
What does _benchmark_tile_reduce_single() call?
_benchmark_tile_reduce_single() calls 1 function(s): _init_device.
What calls _benchmark_tile_reduce_single()?
_benchmark_tile_reduce_single() is called by 1 function(s): test_benchmark_tile_reduce_various_sizes.
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free