SparseVIT:无手工先验的图像操作定位网络

（AAAI 2025）

引言

图像篡改定位（IML）用以识别图像中特定的篡改区域。

由于操作后图像上不可避免地会留下操作痕迹，这些痕迹可以分为语义和非语义（语义无关）特征。语义无关特征指的是突出低级痕迹信息的特征，这些特征独立于图像的语义内容。

几乎所有现有的IML模型都遵循了“语义分割骨干网络”和“手工制作的非语义特征提取”的设计。

大多数基本用到比如DCT、Sobel这些手工制作的，或者自带先验的特征。

而SparseVIT则抛弃了这种先验。显得更end to end。

算法

稀疏自注意力

def alter_sparse(x, sparse_size=8):
    x = x.permute(0, 2, 3, 1)
    assert x.shape[1]%sparse_size == 0 & x.shape[2]%sparse_size == 0, 'image size should be divisible by block_size'
    grid_size = x.shape[1]//sparse_size
    out, H, Hp, C = block(x, grid_size)
    out = out.permute(0, 3, 4, 1, 2, 5).contiguous()
    out = out.reshape(-1, sparse_size, sparse_size, C)
    out = out.permute(0, 3, 1, 2)
    return out, H, Hp, C   


def alter_unsparse(x, H, Hp, C, sparse_size=8):
    x = x.permute(0, 2, 3, 1)
    x = x.reshape(-1, Hp//sparse_size, Hp//sparse_size, sparse_size, sparse_size, C)
    x = x.permute(0, 3, 4, 1, 2, 5).contiguous()
    out = unblock(x, H)
    out = out.permute(0, 3, 1, 2)
    return out

有点像空洞卷积。

轻量级有效预测头Learnable Feature Fusion (LFF)

其中参考了《Going deeper with Image Transformers》中的LayerScale。

最终LFF可以表示为
$$
\begin{align}
F_i=Linear(C_i,C)(F_i),i&=1..4\quad;F_i=Upsample()(F_i),i=5,6
\\
M_p&=ADD(F_i\times\gamma)
\\
M_p&=Linear(C,1)(M_p)
\\
M_p&=Upsample()(M_p)
\end{align}
$$

import torch
import torch.nn as nn
import torch.nn.functional as F
from functools import partial

class Multiple(nn.Module):
    def __init__(self, 
                 init_value = 1e-6,
                 embed_dim = 256,
                 predict_channels = 1,
                 norm_layer = partial(nn.LayerNorm, eps=1e-6) ):
        super(Multiple, self).__init__()
        self.gamma1 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
        self.gamma2 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
        self.gamma3 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
        self.gamma4 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
        self.gamma5 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
        self.gamma6 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
        # self.drop_path = nn.Identity()
        self.norm = norm_layer(embed_dim)
        
        self.conv_layer1 = nn.Conv2d(in_channels=320, out_channels=512, kernel_size=1, stride=1, padding=0)
        self.conv_layer2 = nn.Conv2d(in_channels=320, out_channels=512, kernel_size=1, stride=1, padding=0)
        self.conv_layer3 = nn.Conv2d(in_channels=320, out_channels=512, kernel_size=1, stride=1, padding=0)
        self.conv_layer4 = nn.Conv2d(in_channels=320, out_channels=512, kernel_size=1, stride=1, padding=0)
        self.conv_last = nn.Conv2d(embed_dim, predict_channels, kernel_size= 1)
    def forward(self, x):
        c1, c2, c3, c4, c5, c6 = x
        
        c1 = self.conv_layer1(c1)
        c2 = self.conv_layer2(c2)
        c3 = self.conv_layer3(c3)
        c4 = self.conv_layer4(c4)
        b, c , h, w = c1.shape
        c5 = F.interpolate(c5, size=(h, w), mode='bilinear', align_corners=False)
        c6 = F.interpolate(c6, size=(h, w), mode='bilinear', align_corners=False)
        c1 = c1.flatten(2).transpose(1, 2)
        c2 = c2.flatten(2).transpose(1, 2)
        c3 = c3.flatten(2).transpose(1, 2)
        c4 = c4.flatten(2).transpose(1, 2) 
        c5 = c5.flatten(2).transpose(1, 2)
        c6 = c6.flatten(2).transpose(1, 2)
        x = self.gamma1*c1 + self.gamma2*c2 + self.gamma3*c3 + self.gamma4*c4 + self.gamma5*c5 + self.gamma6*c6
        x= x.transpose(1, 2).reshape(b, c, h, w)
        x = (self.norm(x.permute(0, 2, 3, 1))).permute(0, 3, 1, 2).contiguous()
        x = self.conv_last(x)
        return x

参考资料

https://github.com/scu-zjz/SparseViT

#深度学习 #人工智能

SparseVIT:无手工先验的图像操作定位网络

https://lijianxiong.space/2025/20250719/

作者

LJX

发布于

2025年7月19日

许可协议

通过VIB减少大型视觉-语言模型中的幻觉现象上一篇

多模态速览下一篇