_benchmark_tile_reduce_single() — pytorch Function Reference

Architecture documentation for the _benchmark_tile_reduce_single() function in bench_nvshmem_tile_reduce.py from the pytorch codebase.

Function python CoreTensor Boxing calls 1 called by 1

Entity Profile

CoreTensor→ Boxing→ _benchmark_tile_reduce_single() — pytorch Function Reference

Dependency Diagram

graph TD
  1fb81840_0784_4f41_94ba_38ef34fe84e5["_benchmark_tile_reduce_single()"]
  2b246b4f_aea8_6874_ed04_0bed1d42b788["test_benchmark_tile_reduce_various_sizes()"]
  2b246b4f_aea8_6874_ed04_0bed1d42b788 -->|calls| 1fb81840_0784_4f41_94ba_38ef34fe84e5
  89ca7f4b_c6a3_7849_81c9_ba6dfa92cd70["_init_device()"]
  1fb81840_0784_4f41_94ba_38ef34fe84e5 -->|calls| 89ca7f4b_c6a3_7849_81c9_ba6dfa92cd70
  style 1fb81840_0784_4f41_94ba_38ef34fe84e5 fill:#6366f1,stroke:#818cf8,color:#fff

Relationship Graph

Source Code

benchmarks/distributed/bench_nvshmem_tile_reduce.py lines 51–130

    def _benchmark_tile_reduce_single(
        self,
        full_size: int,
        tile_size: int,
        warmup_iters: int = 5,
        bench_iters: int = 10,
    ) -> dict:
        """
        Benchmark a single configuration of tile reduce.

        Args:
            full_size: Size of the full matrix (full_size x full_size)
            warmup_iters: Number of warmup iterations
            bench_iters: Number of benchmark iterations

        Returns:
            Dictionary with benchmark results
        """
        self._init_device()
        group_name = dist.group.WORLD.group_name

        dtype = torch.float

        # Allocate full matrices
        full_inp = symm_mem.empty(
            full_size, full_size, dtype=dtype, device=self.device
        ).fill_(self.rank)
        full_out = symm_mem.empty(
            full_size, full_size, dtype=dtype, device=self.device
        ).fill_(0)

        slice_ut = slice(0, tile_size)
        inp_tile = full_inp[slice_ut, slice_ut]
        out_tile = full_out[slice_ut, slice_ut]

        root = 0

        # Warmup iterations
        for _ in range(warmup_iters):
            torch.ops.symm_mem.tile_reduce(inp_tile, out_tile, root, group_name)
            torch.cuda.synchronize(self.device)

        # Benchmark iterations
        times = []

        dist.barrier()
        torch.cuda.synchronize(self.device)
        start_time = time.perf_counter()

        for _ in range(bench_iters):
            torch.ops.symm_mem.tile_reduce(inp_tile, out_tile, root, group_name)

        torch.cuda.synchronize(self.device)
        end_time = time.perf_counter()
        times.append((end_time - start_time) / bench_iters)

        # Calculate statistics
        times = torch.tensor(times, dtype=torch.float64)
        tile_elements = tile_size * tile_size
        tile_bytes = (
            tile_elements * dtype.itemsize
            if hasattr(dtype, "itemsize")
            else tile_elements * 4
        )

        results = {
            "full_size": full_size,
            "tile_size": tile_size,
            "tile_elements": tile_elements,
            "tile_bytes": tile_bytes,
            "world_size": self.world_size,
            "mean_time_ms": times.mean().item() * 1000,
            "std_time_ms": times.std().item() * 1000,
            "min_time_ms": times.min().item() * 1000,
            "max_time_ms": times.max().item() * 1000,
            "throughput_gb_s": tile_bytes / (times.mean().item() * 1e9),
            "elements_per_sec": tile_elements / times.mean().item(),
        }

        return results