cpu_index_kernel Class — pytorch Architecture
Architecture documentation for the cpu_index_kernel class in IndexKernelUtils.h from the pytorch codebase.
Entity Profile
Source Code
aten/src/ATen/native/cpu/IndexKernelUtils.h lines 52–83
template <typename scalar_t, typename func_t>
void cpu_index_kernel(TensorIteratorBase& iter, IntArrayRef index_size, IntArrayRef index_stride,
const func_t& f, bool serial_execution=false)
{
int ntensor = iter.ntensors();
// When launch the index parallel version, set a relative small grain size less than the INTERNAL::GRAIN_SIZE
// to make the whole available thread numbers get more balanced work load and a better cache location.
// The grain size here is chosen by the op benchmark to overcome the thread launch overhead
const int index_parallel_grain_size = 3000;
auto loop = [&](char** data, const int64_t* strides, int64_t n) {
auto indexer = Indexer(ntensor - 2, &data[2], &strides[2], index_size, index_stride);
char* dst = data[0];
char* src = data[1];
if (is_constant_index(ntensor, strides)) {
// specialization for when every element uses the same index
int64_t offset = indexer.get(0);
for (const auto i : c10::irange(n)) {
f(dst + strides[0] * i, src + strides[1] * i, offset);
}
} else {
for (const auto i : c10::irange(n)) {
int64_t offset = indexer.get(i);
f(dst + strides[0] * i, src + strides[1] * i, offset);
}
}
};
if (serial_execution) {
iter.serial_for_each(loop, {0, iter.numel()});
} else {
iter.for_each(loop, index_parallel_grain_size);
}
}
Source
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free