AGLA：Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

（ CVPR 2025）

引入了一种图像提示匹配方案，该方案从图像中捕获与提示相关的局部特征，从而获得输入图像的增强视图，其中突出显示与提示相关的内容，同时抑制不相关的干扰。

motivation

多项研究都探讨了物体幻觉的潜在成因。如统计预训练偏差、过度依赖参数化知识。

而作者认为幻觉源于对图像的某种注意力缺陷，即 LVLMs主要关注全局图像特征，而未能捕捉到与提示相关的局部特征。

作者检查了 LVLMs 在响应关于不同物体的文本查询时，其自注意力对图像特征的权重。

方法

首先有图像提示匹配模块。

基于匹配模型计算图像 v 与文本提示 t 之间的总体相似度得分sim(v,t)。

将GradCAM应用于匹配模型的交叉注意力层，得出每个图像块相对于输入提示的相关系数得分。
$$
cor(j)=\frac{1}{H}\sum\sum max(0,\frac{\partial sim(v,t)}{\partial C_{i,j}^{(h)}})C_{i,j}^{(h)}
$$
C表示交叉注意力。

作者把上面这个称为gradcam，但在其他更多地方称为泰勒展开的saliency map.

由sim(v,t)/2来选择mask比例。

代码中还使用了加权平均和伽马矫正：

import numpy as np
from matplotlib import pyplot as plt
from scipy.ndimage import filters
from skimage import transform as skimage_transform


def getAttMap(img, attMap, blur=True, overlap=True):
    attMap -= attMap.min()
    if attMap.max() > 0:
        attMap /= attMap.max()
    attMap = skimage_transform.resize(attMap, (img.shape[:2]), order=3, mode="constant")
    if blur:
        attMap = filters.gaussian_filter(attMap, 0.02 * max(img.shape[:2]))
        attMap -= attMap.min()
        attMap /= attMap.max()
    cmap = plt.get_cmap("jet")
    attMapV = cmap(attMap)
    attMapV = np.delete(attMapV, 3, 2)
    if overlap:
        attMap = (
            1 * (1 - attMap**0.7).reshape(attMap.shape + (1,)) * img
            + (attMap**0.7).reshape(attMap.shape + (1,)) * attMapV
        )
    return attMap

IPM 模块根据这些分数，生成一个“增强视图” (augmented view) 。它会遮蔽 (mask out) 所有与提示无关的背景和干扰物体，只保留（高亮）与提示最相关的图像区域。

也使用了对比解码，即$logit_{原始}+\alpha logit_{增强}$。

同样也使用了对比解码中常用的自适应合理性约束。

#深度学习 #大模型

AGLA：Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

https://lijianxiong.space/2025/20251105/

作者

LJX

发布于

2025年11月5日

许可协议

大模型天生具有某些能力上一篇

ICT：Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models 下一篇