随机块模型

在笔者之前研究图神经网络的时候，一直苦于难以使用数学去之间建模图数据。但似乎这一切有了转机，笔者最近发现了一种特殊的分析方法来分析GNN，对此这就不得不读了。

像之前提到的《Joint Graph Rewiring and Feature Denoising via Spectral Resonance》（ICLR2025）、上篇的《Is homophily a necessity for graph neural networks? 》ICLR2022、《Understanding Non-linearity in Graph Neural Networks from the Bayesian-Inference Perspective》(NIPS2022)、《When Do Graph Neural Networks Help with Node Classification? Investigating the Impact of Homophily Principle on Node Distinguishability》（NIPS2023）都使用到了stochastic block model或contextual stochastic block model（SBM/CSBM)。

SBM

SBM 的基本参数包括：节点数 n、将节点划分为r个社区的分区集合 {C1,…,Cr}，以及一个对称的 r×r边概率矩阵 P。
对于任意两个节点 $u∈C_i，v\in C_j$，在 SBM 中它们以概率 $P_{ij}$ 相连。

def generate_sbm(n, sizes, p_matrix, seed=None):
    """
    Generate a graph from the Stochastic Block Model (SBM).

    Parameters:
    ----------
    n : int
        Total number of nodes.
    sizes : list of int
        Sizes of each community (sum(sizes) == n).
    p_matrix : 2D array-like of shape (r, r)
        Connection probability matrix between communities.
    seed : int, optional
        Random seed for reproducibility.

    Returns:
    -------
    G : networkx.Graph
        Generated SBM graph with node attribute 'block' indicating community.
    labels : np.ndarray of shape (n,)
        Community label for each node.
    """
    if seed is not None:
        np.random.seed(seed)
    r = len(sizes)
    assert p_matrix.shape == (r, r), "p_matrix must be r x r"

    # Assign block labels
    labels = np.hstack([[i] * size for i, size in enumerate(sizes)])

    # Initialize adjacency matrix
    A = np.zeros((n, n), dtype=int)
    for i in range(n):
        for j in range(i + 1, n):
            pi = labels[i]
            pj = labels[j]
            if np.random.rand() < p_matrix[pi, pj]:
                A[i, j] = A[j, i] = 1

    # Build graph
    G = nx.from_numpy_array(A)
    # Assign block attribute
    for idx, lbl in enumerate(labels):
        G.nodes[idx]['block'] = int(lbl)

    return G, labels

CSBM

cSBM 在 SBM 基础上额外定义：

对每个节点 i，从与其社区i相关的属性分布（如高维高斯分布）中抽样特征向量$x_u$。
边的生成仍遵循 SBM 的 $P_{ij}$；节点属性生成由社区-条件分布控制。

def generate_contextual_sbm(n, pi, p_matrix, theta, feature_dist='gaussian', seed=None):
    """
    Generate a graph with attributes from the Contextual SBM (cSBM).

    Parameters:
    ----------
    n : int
        Total number of nodes.
    pi : list or np.ndarray of shape (r,)
        Mixing proportions for each of r communities.
    p_matrix : 2D array-like of shape (r, r)
        Connection probability matrix between communities.
    theta : list of dict
        Parameters for each community's feature distribution.
        For Gaussian: {'mean': array, 'cov': array}
    feature_dist : str
        Distribution type ('gaussian').
    seed : int, optional
        Random seed for reproducibility.

    Returns:
    -------
    G : networkx.Graph
        Generated cSBM graph with node attributes 'block' and 'features'.
    labels : np.ndarray of shape (n,)
        Community label for each node.
    X : np.ndarray of shape (n, d)
        Feature matrix.
    """
    if seed is not None:
        np.random.seed(seed)
    r = len(pi)
    assert p_matrix.shape == (r, r), "p_matrix must be r x r"
    assert len(theta) == r, "theta must have one entry per community"

    # Sample community labels
    labels = np.random.choice(r, size=n, p=pi)
    # Generate features
    # Assume Gaussian mixture
    d = theta[0]['mean'].shape[0]
    X = np.zeros((n, d))
    for i in range(n):
        z = labels[i]
        params = theta[z]
        X[i] = np.random.multivariate_normal(mean=params['mean'], cov=params['cov'])

    # Generate adjacency
    A = np.zeros((n, n), dtype=int)
    for i in range(n):
        for j in range(i + 1, n):
            if np.random.rand() < p_matrix[labels[i], labels[j]]:
                A[i, j] = A[j, i] = 1

    # Build graph
    G = nx.from_numpy_array(A)
    for idx in range(n):
        G.nodes[idx]['block'] = int(labels[idx])
        G.nodes[idx]['features'] = X[idx]

    return G, labels, X

分析

形式确实很简单，有点类似回归中解释是高斯分布。

我们可以来看上篇的《Is homophily a necessity for graph neural networks? 》是怎么使用的。

我们考虑一个图 G，其中每个节点 i 具有特征 x∈ R 标签 y。

我们假设

(1) 节点 i 的特征从特征分布 F 中采样，即 x∼ F，μ(F)表示其均值；

(2) x 的维度之间相互独立；

(3) X 中的特征被一个正标量 B 所限制，即 max|X[i, j]| ≤ B；

(4) 对于节点 i，其邻居的标签从邻居分布 D 中独立采样。采样重复 deg(i)次以采样 deg(i)个邻居的标签。

定理1： 单个GCN操作激活前的输出的期望是：
$$
E[h_i]=W(E_{c\sim D_{y_i},x\sim F_c}[x])
$$
以及对于任何 t > 0，观测值与其期望的距离大于 t 的概率是有界的：
$$
P(||h_i-E[h_i]||_2\ge t)\le 2l\cdot exp(-\frac{deg(i)t^2}{2\rho^2(W)B^2l})
$$

其中$\rho$代表最大奇异值，$l$代表特征维度。

对于CSBM模型$G\sim CSBM(\mu_1,\mu_2,p,q)$，我们可以认为邻域分布$D_{c1}=[\frac{p}{p+q},\frac{q}{p+q}],D_{c2}=[\frac{q}{p+q},\frac{p}{p+q}]$。

基于领域分布，对于$i\in C_1$：
$$
h_i\sim N(\frac{p\mu_1+q\mu_2}{p+q},\frac{I}{\sqrt{deg(i)}})
$$
根据高斯分布的性质，很容易看出定理1成立。

命题1： $(E_{c1}[x_i],E_{c2}[x_i])$和$(E_{c1}[h_i],E_{c2}[h_i])$具有相同的中点m，且$E_{c1}[x_i]-E_{c2}[x_i]$和$E_{c1}[h_i]-E_{c2}[h_i]$具有相同的方向w。$m=\frac{\mu_1+\mu_2}{2},w=\frac{\mu_1-\mu_2}{||\mu_1-\mu_2||_2}$。

定理2： 图符合CSBM，则对于任意节点i，当deg(i)>(p+q)/(p-q)时，由决策边界 P 定义的线性分类器误分类$h_i$的概率低于误分类$x_i$的概率。

证明：

我们仅对来自类$C_1$的节点进行证明。因为对称性，其他的证明相同。

对于节点$i\in C_1$，
$$
\begin{align}
P(x_i \ \text{is mis-classfied})=P(w^Tx_i+b\le0)
\\
P(h_i \ \text{is mis-classfied})=P(w^Th_i+b\le0)
\end{align}
$$

w和$b=-w^T(\mu_1+\mu_2)/2$是决策边界P的参数。

我们有：
$$
P(w^Th_i+b\le 0)=P(w^T\sqrt{deg(i)}h_i+\sqrt{deg(i)}b\le 0)
$$
令$h_i’=\sqrt{deg(i)}h_i$，故：
$$
h_i’=\sqrt{deg(i)}h_i\sim N(\frac{\sqrt{deg(i)}(p\mu_1+q\mu_2)}{p+q},I)
$$
$x_i$和$h_i’$具有相同的方差。

为了比较误分类概率，我们只需比较期望值到决策边界的距离，具体而言：
$$
\begin{align}
dis_{x_i}&=\frac{||\mu_1-\mu_2|| _ 2}{2}
\\
dis_{h_i’}&=\frac{\sqrt{deg(i)}|p-q|}{p+q}\frac{||\mu_1-\mu_2||_2}{2}
\end{align}
$$
得证。

#深度学习 #图神经网络

随机块模型

https://lijianxiong.space/2025/20250513/

作者

LJX

发布于

2025年5月13日

许可协议

（几乎涵盖一切的）表示学习统一框架上一篇

同质性对图神经网络是必需的吗？下一篇