Home / Class/ LogSoftMax Class — pytorch Architecture

LogSoftMax Class — pytorch Architecture

Architecture documentation for the LogSoftMax class in SoftMax.cpp from the pytorch codebase.

Entity Profile

Source Code

aten/src/ATen/native/sparse/SoftMax.cpp lines 158–391

template <typename scalar_t, bool LogSoftMax>
void cpu_sparse_coo_softmax(Tensor output, const Tensor& input, const int64_t dim) {
  /*
    See test/test_sparse.py:test_softmax:sparse_softmax for the Python
    prototype of the sparse softmax algorithm that this implementation
    is based on.

    Derivation of the sparse softmax algorithm with an example
    ----------------------------------------------------------

    Consider the following 2-D sparse tensor with 0-D dense part as an
    example, denote it by X:

      11 ** ** 14 15
      ** 22 ** 24 **

    where `**` represent unspecified entries. The COO sparse tensor
    representation of X is:

      indices = [[0, 1, 0, 1, 0],
                 [0, 1, 3, 3, 4]]
      values = [11, 22, 14, 24, 15]

    that after coalescing becomes

      indices = [[0, 0, 0, 1, 1],
                 [0, 3, 4, 1, 3]]
      values = [11, 14, 15, 22, 24]

    The softmax of X along the given dimension d is defined as

      S_d[i, j] = exp(X[i, j]) / sum(exp(X[I_d[k]]), k=0..X.shape[d]-1)

    where the index tuple I_d[k] is defined as

      I_0[k] = k, j
      I_1[k] = i, k

    For sparse tensors, the unspecified entries are skipped in the
    softmax sum of exponents so that the result will be sparse tensor
    with the same indices as the input. Mathematically, this
    corresponds to the case where the unspecified entries are
    interpreted as negative infinities rather than zeros.

    To minimize the defects from numerical evaluation of exponents
    with very large or small arguments, the softmax implementation
    uses the following a numerically stable definition:

      S_d[i, j] = exp(X[i, j] - maxX_d) / sum(exp(X[I_d[k]] - maxX_d), k=0...X.shape[d]-1)

    where

      maxX_d = max(X[I_d[k]], k=0...X.shape[d]-1)

    is the maximum tensor along the direction d (it has dimensionality
    `maxX_d.ndim = X.ndim - 1`).

    For the example sparse tensor X, we have:

      S_0._indices() == S_1._indices() == X._indices()

      maxX_0 = [11, 22, -inf, 24, 15]
      maxX_1 = [15, 24]

      S_0._values() = [exp(11 - maxX_0[0]) / exp(11 - maxX_0[0]),
                       exp(14 - maxX_0[3]) / (exp(14 - maxX_0[3]) + exp(24 - maxX_0[3])),
                       exp(15 - maxX_0[4]) / exp(15 - maxX_0[4]),
                       exp(22 - maxX_0[1]) / exp(22 - maxX_0[1]),
                       exp(24 - maxX_0[3]) / (exp(14 - maxX_0[3]) + exp(24 - maxX_0[3]))]
                    = [1, exp(-10)/(exp(-10) + 1), 1, 1, 1/(exp(-10) + 1)]

      (note that `maxX_0[2] == -inf` not used to obtain S_0)

      S_1._values() = [exp(11 - maxX_1[0]) / (exp(11 - maxX_1[0]) + exp(14 - maxX_1[0]) + exp(15 - maxX_1[0])),
                       exp(14 - maxX_1[0]) / (exp(11 - maxX_1[0]) + exp(14 - maxX_1[0]) + exp(15 - maxX_1[0])),
                       exp(15 - maxX_1[0]) / (exp(11 - maxX_1[0]) + exp(14 - maxX_1[0]) + exp(15 - maxX_1[0])),
                       exp(22 - maxX_1[1]) / (exp(22 - maxX_1[1]) + exp(24 - maxX_1[1])),
                       exp(24 - maxX_1[1]) / (exp(22 - maxX_1[1]) + exp(24 - maxX_1[1]))]
                    = [exp(-4) / (exp(-4) + exp(-1) + 1),
                       exp(-1) / (exp(-4) + exp(-1) + 1),
                       1 / (exp(-4) + exp(-1) + 1),
                       exp(-2) / (exp(-2) + 1),
                       1 / (exp(-2) + 1)]

    To obtain the above via the for-loop over
    `nnz(=len(X._values()))`, we introduce the indices mapping `pool`
    as follows:

      indices = X._indices()
      for i in range(nnz):
          for j in range(nnz):
              if indices[d, i] == indices[d, j]:
                  assert pool_d[i] == pool_d[j]
              else:
                  assert pool_d[i] != pool_d[j]

    that is, the entries with values indices i and j are in the same
    pool iff their locations in the grid of tensor indices align with
    the direction along which the softmax is calculated. The `pool`
    mapping maps the X._values() indices to the corresponding pool
    index.

    To save memory and processor resources, we pre-compute the entries
    of maxX tensor and the sums of exponents as follows:

      mx_d = [max(values[i] for i in range(nnz) if pool_0[i] == k) for k in pool_d]
      exp_sum_d = [sum(exp(values[i] - mx_d[k]) for i in range(nnz) if pool_d[i] == k) for k in pool_d]

    For example, if

      pool_0 = [0, 1, 2, 3, 1]
      pool_1 = [0, 0, 0, 1, 1]

    then

      mx_0 = [11, 24, 15, 22]
      mx_1 = [15, 24]
      exp_sum_0 = [1, (exp(-10) + 1), 1, 1]
      exp_sum_1 = [(exp(-4) + exp(-1) + 1), (exp(-2) + 1)]

    and

      S_0._values() = [exp(11 - mx_0[pool_0[0]]) / exp_sum_0[pool_0[0]]
                       exp(14 - mx_0[pool_0[1]]) / exp_sum_0[pool_0[1]]
                       exp(15 - mx_0[pool_0[2]]) / exp_sum_0[pool_0[2]]
                       exp(22 - mx_0[pool_0[3]]) / exp_sum_0[pool_0[3]]
                       exp(24 - mx_0[pool_0[4]]) / exp_sum_0[pool_0[4]]

    or in general,

      S_d._values() = [exp(values[i] - mx_d[pool_d[i]]) / exp_sum_d[pool_d[i] for i in range(nnz)]

    The above algorithm can be easily extended for cases with
    non-scalar dense part of the sparse tensor where all scalar
    operations become element-wise tensor operations.

    The implementation below has more optimizations such as that
    collect pool indices for enabling concurrency, minimize the calls
    to exp functions as well as reuse of softmax implementation for
    log_softmax.
  */
  using accscalar_t = at::acc_type<scalar_t, false>;
  auto sparse_dim = input.sparse_dim();
  auto indices = input._indices().contiguous();
  auto values = input._values().contiguous();
  auto out_values = output._values();
  auto out_indices = output._indices();
  out_values.resize_as_(values);
  out_indices.resize_as_(indices);
  out_indices.copy_(indices);

  if (dim >= sparse_dim) {
    if (LogSoftMax) {
      auto new_values =
          at::cpu::_log_softmax(values, dim - sparse_dim + 1, false);
      out_values.set_(new_values);
    } else {
      auto new_values = at::cpu::_softmax(values, dim - sparse_dim + 1, false);
      out_values.set_(new_values);
    }
    return;
  }

  auto nnz = values.size(0);
  auto sizes = input.sizes();
  auto nvalues = get_nvalues(sizes, sparse_dim);

  /* Prepare accessors */
  auto values_2 = values.view({nnz, nvalues});
  auto values_accessor = values_2.accessor<scalar_t, 2>();

  auto out_values_2 = out_values.view({nnz, nvalues});
  auto out_values_accessor = out_values_2.accessor<scalar_t, 2>();

  /* Compute independent pools of indices */
  auto pools = get_pools(indices, sizes, dim);

  int64_t grain_size = 1;
  parallel_for(0, pools.size(), grain_size, [&](int64_t begin, int64_t end) {
      for (const auto p : c10::irange(begin, end)) {
        auto pool_indices = pools[p];

        // Skip empty pools
        if (pool_indices.empty())
          continue;

        /* Prepare scratch space */
        std::vector<accscalar_t> mx_row(nvalues, -std::numeric_limits<accscalar_t>::infinity());
        std::vector<accscalar_t> exp_sums_row(nvalues, 0);

        /* Compute mx */
        for (int64_t i : pool_indices) {
          auto values_row = values_accessor[i];
          for (const auto j : c10::irange(nvalues)) {
            mx_row[j] = std::max(mx_row[j], accscalar_t(values_row[j]));
          }
        }

        /* Apply exp to (v - mx) and sum the results */
        for (int64_t i : pool_indices) {
          auto values_row = values_accessor[i];
          auto out_values_row = out_values_accessor[i];
          for (const auto j : c10::irange(nvalues)) {
            auto v = std::exp(values_row[j] - mx_row[j]);
            if (!LogSoftMax) {
              out_values_row[j] = v;
            }
            exp_sums_row[j] += v;
          }
        }

        for (const auto j : c10::irange(nvalues)) {
          if (LogSoftMax) {
            mx_row[j] += std::log(exp_sums_row[j]);
          } else {
            exp_sums_row[j] = 1.0 / exp_sums_row[j];
          }
        }

        /* Normalize with the sum of exponents */
        for (int64_t i : pool_indices) {
          auto values_row = values_accessor[i];
          auto out_values_row = out_values_accessor[i];
          for (const auto j : c10::irange(nvalues)) {
            if (LogSoftMax) {
              out_values_row[j] = values_row[j] - mx_row[j];
            } else {
              out_values_row[j] *= exp_sums_row[j];
            }
          }
        }
      }
    });
}

Analyze Your Own Codebase

Get architecture documentation, dependency graphs, and domain analysis for your codebase in minutes.

Try Supermodel Free