卷积层与位置编码

在NLP领域中，比如经典的textcnn是没有位置编码，那么CNN中的位置信息是从何而来的呢？

CNN中蕴含了多少信息？

ICLR 2020的《How much position information do convolutional neural network encode?》给了我们答案。

作者设计了一套位置编码网络 (PosENet）来探索位置信息。

即提取VGG的五层特征，插值到同一size后concat起来，再接几层卷积层来提取特征，以映射到指定的gt。

gt如右图所示，它是自己指定的，共有五种。它的设计基于这样的假设——若CNN并没有编码位置信息，只凭图片信息是无法学习到这样的形状的。

作者在不同模式下做了训练，可以发现设计的网络自身很难学习到位置信息，这佐证了CNN是可以学到位置信息的。

和位置信息有关的因素

作者进一步分析，得到了以下结论。

层数越多，位置信息能学习得更好。 一方面，卷积层在空间上等同于一个 5 × 5 卷积层 (VGG论文)。另一种可能性是位置信息可能以需要超过一阶推理的方式进行表示。

卷积核越大，位置信息能学习得更好。 位置信息可能在层内和特征空间中以更大的感受野分布，从而更好地解析位置信息。

位置信息的来源

作者进一步做了实验，并给出了一个令人想不到的结论——是zero padding 驱动位置信息的学习！

CNN可以使用位置编码吗

既然CNN已经能学习到位置编码，那么我们继续使用会有帮助吗？

Facebook出品的《Convolutional Sequence to Sequence Learning》正是使用了位置编码的卷积神经网络之作。

架构如下：

输入的embedding使用绝对位置编码。即输入为$x=(x_1,x_2,…)$，$p=(p_1,p_2,…)$，则$e=x=(x_1+p_1,x_2+p_2,…)$。

另一个关键的地方是使用了attention。

其中$d_i=W_dh_i+b_d+g_i$。

实际上也是一种QKV注意力，QKV如图所示。即e也会参与到attention中的参与中。

代码：

class AttentionLayer(nn.Module):
    def __init__(self, conv_channels, embed_dim, bmm=None):
        super().__init__()
        # projects from output of convolution to embedding dimension
        self.in_projection = Linear(conv_channels, embed_dim)
        # projects from embedding dimension to convolution size
        self.out_projection = Linear(embed_dim, conv_channels)

        self.bmm = bmm if bmm is not None else torch.bmm

    def forward(self, x, target_embedding, encoder_out, encoder_padding_mask):
        residual = x

        # attention
        x = (self.in_projection(x) + target_embedding) * math.sqrt(0.5)
        x = self.bmm(x, encoder_out[0])

        # don't attend over padding
        if encoder_padding_mask is not None:
            x = (
                x.float()
                .masked_fill(encoder_padding_mask.unsqueeze(1), float("-inf"))
                .type_as(x)
            )  # FP16 support: cast to float and back

        # softmax over last dim
        sz = x.size()
        x = F.softmax(x.view(sz[0] * sz[1], sz[2]), dim=1)
        x = x.view(sz)
        attn_scores = x

        x = self.bmm(x, encoder_out[1])

        # scale attention output (respecting potentially different lengths)
        s = encoder_out[1].size(1)
        if encoder_padding_mask is None:
            x = x * (s * math.sqrt(1.0 / s))
        else:
            s = s - encoder_padding_mask.type_as(x).sum(
                dim=1, keepdim=True
            )  # exclude padding
            s = s.unsqueeze(-1)
            x = x * (s * s.rsqrt())

        # project back
        x = (self.out_projection(x) + residual) * math.sqrt(0.5)
        return x, attn_scores

    def make_generation_fast_(self, beamable_mm_beam_size=None, **kwargs):
        """Replace torch.bmm with BeamableMM."""
        if beamable_mm_beam_size is not None:
            del self.bmm
            self.add_module("bmm", BeamableMM(beamable_mm_beam_size))

题外话

以上说明了我们不能禁锢于成见。

该博客启发于NORCE举办的地质预报挑战赛，笔者幸运地获得了第一名。

参考资料

1.https://sh-tsang.medium.com/review-convolutional-sequence-to-sequence-learning-convs2s-510a9eddce05

#深度学习 #计算机视觉 #自然语言处理

卷积层与位置编码

https://lijianxiong.space/2025/20250601/

作者

LJX

发布于

2025年6月1日

许可协议

时间序列预测与后门攻击上一篇

Child Mind Institute历届比赛获胜方案下一篇