LogSoftMax Class — pytorch Architecture
Architecture documentation for the LogSoftMax class in SoftMax.cpp from the pytorch codebase.
Entity Profile
Source Code
aten/src/ATen/native/sparse/SoftMax.cpp lines 158–391
template <typename scalar_t, bool LogSoftMax>
void cpu_sparse_coo_softmax(Tensor output, const Tensor& input, const int64_t dim) {
/*
See test/test_sparse.py:test_softmax:sparse_softmax for the Python
prototype of the sparse softmax algorithm that this implementation
is based on.
Derivation of the sparse softmax algorithm with an example
----------------------------------------------------------
Consider the following 2-D sparse tensor with 0-D dense part as an
example, denote it by X:
11 ** ** 14 15
** 22 ** 24 **
where `**` represent unspecified entries. The COO sparse tensor
representation of X is:
indices = [[0, 1, 0, 1, 0],
[0, 1, 3, 3, 4]]
values = [11, 22, 14, 24, 15]
that after coalescing becomes
indices = [[0, 0, 0, 1, 1],
[0, 3, 4, 1, 3]]
values = [11, 14, 15, 22, 24]
The softmax of X along the given dimension d is defined as
S_d[i, j] = exp(X[i, j]) / sum(exp(X[I_d[k]]), k=0..X.shape[d]-1)
where the index tuple I_d[k] is defined as
I_0[k] = k, j
I_1[k] = i, k
For sparse tensors, the unspecified entries are skipped in the
softmax sum of exponents so that the result will be sparse tensor
with the same indices as the input. Mathematically, this
corresponds to the case where the unspecified entries are
interpreted as negative infinities rather than zeros.
To minimize the defects from numerical evaluation of exponents
with very large or small arguments, the softmax implementation
uses the following a numerically stable definition:
S_d[i, j] = exp(X[i, j] - maxX_d) / sum(exp(X[I_d[k]] - maxX_d), k=0...X.shape[d]-1)
where
maxX_d = max(X[I_d[k]], k=0...X.shape[d]-1)
is the maximum tensor along the direction d (it has dimensionality
`maxX_d.ndim = X.ndim - 1`).
For the example sparse tensor X, we have:
S_0._indices() == S_1._indices() == X._indices()
maxX_0 = [11, 22, -inf, 24, 15]
maxX_1 = [15, 24]
S_0._values() = [exp(11 - maxX_0[0]) / exp(11 - maxX_0[0]),
exp(14 - maxX_0[3]) / (exp(14 - maxX_0[3]) + exp(24 - maxX_0[3])),
exp(15 - maxX_0[4]) / exp(15 - maxX_0[4]),
exp(22 - maxX_0[1]) / exp(22 - maxX_0[1]),
exp(24 - maxX_0[3]) / (exp(14 - maxX_0[3]) + exp(24 - maxX_0[3]))]
= [1, exp(-10)/(exp(-10) + 1), 1, 1, 1/(exp(-10) + 1)]
(note that `maxX_0[2] == -inf` not used to obtain S_0)
S_1._values() = [exp(11 - maxX_1[0]) / (exp(11 - maxX_1[0]) + exp(14 - maxX_1[0]) + exp(15 - maxX_1[0])),
exp(14 - maxX_1[0]) / (exp(11 - maxX_1[0]) + exp(14 - maxX_1[0]) + exp(15 - maxX_1[0])),
exp(15 - maxX_1[0]) / (exp(11 - maxX_1[0]) + exp(14 - maxX_1[0]) + exp(15 - maxX_1[0])),
exp(22 - maxX_1[1]) / (exp(22 - maxX_1[1]) + exp(24 - maxX_1[1])),
exp(24 - maxX_1[1]) / (exp(22 - maxX_1[1]) + exp(24 - maxX_1[1]))]
= [exp(-4) / (exp(-4) + exp(-1) + 1),
exp(-1) / (exp(-4) + exp(-1) + 1),
1 / (exp(-4) + exp(-1) + 1),
exp(-2) / (exp(-2) + 1),
1 / (exp(-2) + 1)]
To obtain the above via the for-loop over
`nnz(=len(X._values()))`, we introduce the indices mapping `pool`
as follows:
indices = X._indices()
for i in range(nnz):
for j in range(nnz):
if indices[d, i] == indices[d, j]:
assert pool_d[i] == pool_d[j]
else:
assert pool_d[i] != pool_d[j]
that is, the entries with values indices i and j are in the same
pool iff their locations in the grid of tensor indices align with
the direction along which the softmax is calculated. The `pool`
mapping maps the X._values() indices to the corresponding pool
index.
To save memory and processor resources, we pre-compute the entries
of maxX tensor and the sums of exponents as follows:
mx_d = [max(values[i] for i in range(nnz) if pool_0[i] == k) for k in pool_d]
exp_sum_d = [sum(exp(values[i] - mx_d[k]) for i in range(nnz) if pool_d[i] == k) for k in pool_d]
For example, if
pool_0 = [0, 1, 2, 3, 1]
pool_1 = [0, 0, 0, 1, 1]
then
mx_0 = [11, 24, 15, 22]
mx_1 = [15, 24]
exp_sum_0 = [1, (exp(-10) + 1), 1, 1]
exp_sum_1 = [(exp(-4) + exp(-1) + 1), (exp(-2) + 1)]
and
S_0._values() = [exp(11 - mx_0[pool_0[0]]) / exp_sum_0[pool_0[0]]
exp(14 - mx_0[pool_0[1]]) / exp_sum_0[pool_0[1]]
exp(15 - mx_0[pool_0[2]]) / exp_sum_0[pool_0[2]]
exp(22 - mx_0[pool_0[3]]) / exp_sum_0[pool_0[3]]
exp(24 - mx_0[pool_0[4]]) / exp_sum_0[pool_0[4]]
or in general,
S_d._values() = [exp(values[i] - mx_d[pool_d[i]]) / exp_sum_d[pool_d[i] for i in range(nnz)]
The above algorithm can be easily extended for cases with
non-scalar dense part of the sparse tensor where all scalar
operations become element-wise tensor operations.
The implementation below has more optimizations such as that
collect pool indices for enabling concurrency, minimize the calls
to exp functions as well as reuse of softmax implementation for
log_softmax.
*/
using accscalar_t = at::acc_type<scalar_t, false>;
auto sparse_dim = input.sparse_dim();
auto indices = input._indices().contiguous();
auto values = input._values().contiguous();
auto out_values = output._values();
auto out_indices = output._indices();
out_values.resize_as_(values);
out_indices.resize_as_(indices);
out_indices.copy_(indices);
if (dim >= sparse_dim) {
if (LogSoftMax) {
auto new_values =
at::cpu::_log_softmax(values, dim - sparse_dim + 1, false);
out_values.set_(new_values);
} else {
auto new_values = at::cpu::_softmax(values, dim - sparse_dim + 1, false);
out_values.set_(new_values);
}
return;
}
auto nnz = values.size(0);
auto sizes = input.sizes();
auto nvalues = get_nvalues(sizes, sparse_dim);
/* Prepare accessors */
auto values_2 = values.view({nnz, nvalues});
auto values_accessor = values_2.accessor<scalar_t, 2>();
auto out_values_2 = out_values.view({nnz, nvalues});
auto out_values_accessor = out_values_2.accessor<scalar_t, 2>();
/* Compute independent pools of indices */
auto pools = get_pools(indices, sizes, dim);
int64_t grain_size = 1;
parallel_for(0, pools.size(), grain_size, [&](int64_t begin, int64_t end) {
for (const auto p : c10::irange(begin, end)) {
auto pool_indices = pools[p];
// Skip empty pools
if (pool_indices.empty())
continue;
/* Prepare scratch space */
std::vector<accscalar_t> mx_row(nvalues, -std::numeric_limits<accscalar_t>::infinity());
std::vector<accscalar_t> exp_sums_row(nvalues, 0);
/* Compute mx */
for (int64_t i : pool_indices) {
auto values_row = values_accessor[i];
for (const auto j : c10::irange(nvalues)) {
mx_row[j] = std::max(mx_row[j], accscalar_t(values_row[j]));
}
}
/* Apply exp to (v - mx) and sum the results */
for (int64_t i : pool_indices) {
auto values_row = values_accessor[i];
auto out_values_row = out_values_accessor[i];
for (const auto j : c10::irange(nvalues)) {
auto v = std::exp(values_row[j] - mx_row[j]);
if (!LogSoftMax) {
out_values_row[j] = v;
}
exp_sums_row[j] += v;
}
}
for (const auto j : c10::irange(nvalues)) {
if (LogSoftMax) {
mx_row[j] += std::log(exp_sums_row[j]);
} else {
exp_sums_row[j] = 1.0 / exp_sums_row[j];
}
}
/* Normalize with the sum of exponents */
for (int64_t i : pool_indices) {
auto values_row = values_accessor[i];
auto out_values_row = out_values_accessor[i];
for (const auto j : c10::irange(nvalues)) {
if (LogSoftMax) {
out_values_row[j] = values_row[j] - mx_row[j];
} else {
out_values_row[j] *= exp_sums_row[j];
}
}
}
}
});
}
Source
Analyze Your Own Codebase
Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.
Try Supermodel Free