SparseVIT:无手工先验的图像操作定位网络

(AAAI 2025)

引言

图像篡改定位(IML)用以识别图像中特定的篡改区域。

由于操作后图像上不可避免地会留下操作痕迹,这些痕迹可以分为语义和非语义(语义无关)特征。语义无关特征指的是突出低级痕迹信息的特征,这些特征独立于图像的语义内容。

几乎所有现有的IML模型都遵循了“语义分割骨干网络”和“手工制作的非语义特征提取”的设计。

大多数基本用到比如DCT、Sobel这些手工制作的,或者自带先验的特征。

而SparseVIT则抛弃了这种先验。显得更end to end。

算法

稀疏自注意力

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def alter_sparse(x, sparse_size=8):
x = x.permute(0, 2, 3, 1)
assert x.shape[1]%sparse_size == 0 & x.shape[2]%sparse_size == 0, 'image size should be divisible by block_size'
grid_size = x.shape[1]//sparse_size
out, H, Hp, C = block(x, grid_size)
out = out.permute(0, 3, 4, 1, 2, 5).contiguous()
out = out.reshape(-1, sparse_size, sparse_size, C)
out = out.permute(0, 3, 1, 2)
return out, H, Hp, C


def alter_unsparse(x, H, Hp, C, sparse_size=8):
x = x.permute(0, 2, 3, 1)
x = x.reshape(-1, Hp//sparse_size, Hp//sparse_size, sparse_size, sparse_size, C)
x = x.permute(0, 3, 4, 1, 2, 5).contiguous()
out = unblock(x, H)
out = out.permute(0, 3, 1, 2)
return out

有点像空洞卷积。

轻量级有效预测头Learnable Feature Fusion (LFF)

其中参考了《Going deeper with Image Transformers》中的LayerScale。

最终LFF可以表示为
$$
\begin{align}
F_i=Linear(C_i,C)(F_i),i&=1…4\quad;F_i=Upsample()(F_i),i=5,6
\\
M_p&=ADD(F_i\times\gamma)
\\
M_p&=Linear(C,1)(M_p)
\\
M_p&=Upsample()(M_p)
\end{align}
$$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import torch
import torch.nn as nn
import torch.nn.functional as F
from functools import partial

class Multiple(nn.Module):
def __init__(self,
init_value = 1e-6,
embed_dim = 256,
predict_channels = 1,
norm_layer = partial(nn.LayerNorm, eps=1e-6) ):
super(Multiple, self).__init__()
self.gamma1 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
self.gamma2 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
self.gamma3 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
self.gamma4 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
self.gamma5 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
self.gamma6 = nn.Parameter(init_value * torch.ones((embed_dim)),requires_grad=True)
# self.drop_path = nn.Identity()
self.norm = norm_layer(embed_dim)

self.conv_layer1 = nn.Conv2d(in_channels=320, out_channels=512, kernel_size=1, stride=1, padding=0)
self.conv_layer2 = nn.Conv2d(in_channels=320, out_channels=512, kernel_size=1, stride=1, padding=0)
self.conv_layer3 = nn.Conv2d(in_channels=320, out_channels=512, kernel_size=1, stride=1, padding=0)
self.conv_layer4 = nn.Conv2d(in_channels=320, out_channels=512, kernel_size=1, stride=1, padding=0)
self.conv_last = nn.Conv2d(embed_dim, predict_channels, kernel_size= 1)
def forward(self, x):
c1, c2, c3, c4, c5, c6 = x

c1 = self.conv_layer1(c1)
c2 = self.conv_layer2(c2)
c3 = self.conv_layer3(c3)
c4 = self.conv_layer4(c4)
b, c , h, w = c1.shape
c5 = F.interpolate(c5, size=(h, w), mode='bilinear', align_corners=False)
c6 = F.interpolate(c6, size=(h, w), mode='bilinear', align_corners=False)
c1 = c1.flatten(2).transpose(1, 2)
c2 = c2.flatten(2).transpose(1, 2)
c3 = c3.flatten(2).transpose(1, 2)
c4 = c4.flatten(2).transpose(1, 2)
c5 = c5.flatten(2).transpose(1, 2)
c6 = c6.flatten(2).transpose(1, 2)
x = self.gamma1*c1 + self.gamma2*c2 + self.gamma3*c3 + self.gamma4*c4 + self.gamma5*c5 + self.gamma6*c6
x= x.transpose(1, 2).reshape(b, c, h, w)
x = (self.norm(x.permute(0, 2, 3, 1))).permute(0, 3, 1, 2).contiguous()
x = self.conv_last(x)
return x

参考资料

https://github.com/scu-zjz/SparseViT


SparseVIT:无手工先验的图像操作定位网络
https://lijianxiong.space/2025/20250719/
作者
LJX
发布于
2025年7月19日
许可协议