reduce_sparse_csr_dim0_cpu_template Class — pytorch Architecture
Architecture documentation for the reduce_sparse_csr_dim0_cpu_template class in SparseCsrTensorMath.cpp from the pytorch codebase.
Entity Profile
Source Code
aten/src/ATen/native/sparse/SparseCsrTensorMath.cpp lines 1036–1120
template <typename scalar_t, typename ReductionOp>
Tensor reduce_sparse_csr_dim0_cpu_template(const Tensor& sparse, ReductionOp rop) {
/*
Consider the following sparse tensor:
1 * * * *
* * * 2 *
* * 3 * *
* * * * *
4 * 5 * *
that has CSR representation
crow_indices = [0, 1, 2, 3, 3, 5]
col_indices = [0, 3, 2, 0, 2]
values = [1, 2, 3, 4, 5]
Reduction with dim=0 results:
rop(1,4) * rop(3,5) 2 *
that has CSR representation
new_crow_indices = [0, 3]
new_col_indices = [0, 2, 3]
new_values = [rop(1, 4], rop(3, 5), 2]
In general, the CSR representation data can be computed as follows:
new_col_indices, col_map = col_indices.unique(sorted=True, return_inverse=True)
nnz = new_col_indices.numel()
new_crow_indices = [0, nnz]
new_values.resize(nnz); new_values.fill_(identity)
for i in range(col_indices.numel()):
new_values[col_map[i]] = rop(new_values[col_map[i], values[i])
*/
Tensor col_indices = sparse.col_indices();
Tensor values = sparse.values();
auto numel = values.numel();
/*
Calling at::_unique constitutes the main bottleneck of this
function. However, it is still about 5x faster than using the
invariant:
csr.sum(dim=0) == csr.transpose(0, 1).sum(dim=1)
*/
auto [new_col_indices, columns_map] = at::_unique(col_indices, true, true);
auto nnz = new_col_indices.numel();
Tensor new_crow_indices = at::empty({2}, col_indices.options());
new_crow_indices[0] = 0;
new_crow_indices[1] = nnz;
// Set `is_cuda` = `true` in acc_type in CPU backend. Because the accumulate type
// of float should be float in current scenario. In CUDA, float is the accumulate type
// of float, while in CPU, double is the accumulate type of float.
using acc_t = at::acc_type<scalar_t, true>;
auto acc_buffer = at::sparse_csr::create_acc_buffer<acc_t, scalar_t>(
values.options(), values.scalar_type(), nnz);
Tensor new_values = std::get<0>(acc_buffer);
Tensor new_values_acc = std::get<1>(acc_buffer);
new_values_acc.fill_(rop.identity());
int64_t* columns_map_ptr = columns_map.data_ptr<int64_t>();
scalar_t* values_ptr = values.data_ptr<scalar_t>();
acc_t* new_values_acc_ptr =
new_values_acc.data_ptr<acc_t>();
// There is no point in parallelizing the following for-loop
// because about 99.3% of the computation time is spent in the
// at::_unique call above.
for (const auto i : c10::irange(numel)) {
int64_t col = columns_map_ptr[i];
scalar_t val = values_ptr[i];
new_values_acc_ptr[col] = rop(new_values_acc_ptr[col], static_cast<acc_t>(val));
}
copy_from_acc_buffer(new_values, new_values_acc);
return at::native::_sparse_csr_tensor_unsafe(new_crow_indices, new_col_indices, new_values,
{1, sparse.size(1)},
new_values.scalar_type(),
sparse.layout(),
new_values.device());
}
Source
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free