CrossEntropyLoss之我见

看了一天关于各种熵的解释，比较好的文章已经添加到了Bookmark。本文中的公式均来自PyTorch文档。

Softmax

把数据缩放到$[0,1]$，且和为$1$。对每个元素进行了操作，结果并不改变维数，即输入N维，输出还是N维。

$\operatorname{Softmax}\left(x_{i}\right)=\frac{\exp \left(x_{i}\right)}{\sum_{j} \exp \left(x_{j}\right)}$

LogSoftmax

结果似乎比Softmax更加稳定，结果的范围为$[-inf,0)$。结果维数依然不变。

$\log \operatorname{Softmax}\left(x_{i}\right)=\log \left(\frac{\exp \left(x_{i}\right)}{\sum_{j} \exp \left(x_{j}\right)}\right)$

NLLLoss

negative log likelihood loss，负对数似然损失函数，对Log后的结果进行损失计算，例如Logsoftmax。 $l_{n}=-w_{y_n}x_{n,y_n}$ ，如果不传入weight，那么w就是1，所以 $l_{n}=-x_{n,y_n}$ ，即input的负数，reduction=mean也就变成了直接除以$N$。

$\ell(x, y)=\left\{ \begin{array}{ll} {\{l_{1}, \dots, l_{N}\},} & {\text{if reduction}=\text{'none';}} \\ {\sum_{n=1}^{N} l_{n},} & {\text{if reduction}=\text {'sum';}} \\ {\sum_{n=1}^{N} \frac{1}{\sum_{n=1}^{N} w_{y_n}} l_{n},} & {\text{if reduction}=\text{'mean'.}} \\ \end{array}\right.$

CrossEntropyLoss

关于PyTorch官网的CrossEntropyLoss补充了几组小实验。
文档中给出的CrossEntropyLoss的公式如下(以下讨论的均不带weight)：

$\operatorname{loss}(x, class)=-\log \left(\frac{\exp (x[class])}{\sum_{j} \exp (x[j])}\right)$

其中的 $\exp (x[class])$ 越大，就是说当这个类别对应的值越大，即占的比例越大，loss值越小，结果越靠近0。
用一个样本简单模拟一下这个过程：

>>> loss = nn.CrossEntropyLoss()
>>> input = torch.randn(1, 5, requires_grad=True)
>>> input
tensor([[-1.4976,  0.1278,  1.6863, -0.3295,  1.4173]], requires_grad=True)
>>> target = torch.empty(1, dtype=torch.long).random_(5)
>>> target
tensor([3])
>>> loss(input, target)
tensor(2.7809, grad_fn=<NllLossBackward>)
>>> i = input.detach().numpy()
>>> i = i[0]
>>> i
array([-1.4976473 ,  0.12782747,  1.6863296 , -0.32949358,  1.4173082 ],
      dtype=float32)
>>> np.log(np.sum(np.exp(i))/np.exp(i[3]))
2.7809231

文档中提到，对于多batch的情况，The losses are averaged across observations for each minibatch.

还有一句话，This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.所以上边给出的公式就是负的LogSoftmax，负号把取值区间变成了(0,inf]，越靠近0，则结果越好，这样只需要最小化loss即可，下面看一下CrossEntropyLoss源码，两者是怎么结合的：

1	return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)

先对它进行logsoftmax，然后再nllloss，默认的参数reduction='mean'，所以出现了上边所说的，结果是各个minibatch的均值。下边进行一下多minibatch的实验，分别指定reduction为mean和none。

>>> input = torch.randn(3, 5, requires_grad=True)
>>> target = torch.empty(3, dtype=torch.long).random_(5)
>>> loss = nn.CrossEntropyLoss(reduction='none')
>>> loss(input, target)
tensor([1.2951, 1.9074, 1.6085], grad_fn=<NllLossBackward>)
>>> i = input.detach().numpy()
>>> i
array([[ 1.7035984 , -0.8095179 , -0.22653757, -0.19294807,  1.0479178 ],
       [ 0.37498614, -0.26111123, -2.1978652 ,  0.80966765, -0.5037327 ],
       [ 0.87008125, -1.5282617 ,  0.25803488, -1.0827187 ,  2.0395918 ]],
      dtype=float32)
>>> i=i[0]
>>> i
array([ 1.7035984 , -0.8095179 , -0.22653757, -0.19294807,  1.0479178 ],
      dtype=float32)
>>> target
tensor([4, 1, 0])
>>> np.log(np.sum(np.exp(i))/np.exp(i[4]))
1.2950675
>>> i = input.detach().numpy()[1]
>>> np.log(np.sum(np.exp(i))/np.exp(i[1]))
1.9073898

可以看到，已经计算出了reduction为none时的值，下边设为mean：

>>> loss_mean = nn.CrossEntropyLoss(reduction='mean')
>>> loss_mean(input, target)
tensor(1.6037, grad_fn=<NllLossBackward>)
>>> (1.2951 + 1.9074 + 1.6085)/3
1.6036666666666666

恍然大悟

我一直有个问题，就是交叉熵的公式分明是 $H(p, q)=-\sum_{x} p(x) \log q(x)$ ，但是怎么到了CrossEntropyLoss里边$p(x)$就不见了，《经典损失函数：交叉熵》这篇文章写的非常好，不仅解释了为什么交叉熵要和Softmax一起用，而且让我明白了$p(x)$去哪了。

我直接看的是PyTorch的源码，这里给了我一点误解。这篇文章中提到，CrossEntropyLoss实际上应该是先计算Softmax，把分布转换到概率分布，得到$q(x)$，然后在计算交叉熵。在计算交叉熵的时候，$p(x)$是one-hot之后的编码，也就是说只有对应的类别的那个值是1，其他的都是0，也就得到了公式 $H(p, q)=- \log q(x_{class})$ ，消去了求和记号和$p(x)$，再把 $q(x_{class})$ 替换成对应的Softmax公式，即可得到了PyTorch中给出的CrossEntropyLoss公式。但是PyTorch中不是这么组织这个过程的，它是使用了$LogSoftmax+NLLLoss$的形式，其实实现的相同的方法。
再次需要注意的是，NLLLoss中的 $1/N$ 和 $\sum$ ，是计算的不同样本之间结果，和交叉熵中的求和记号区分开。