GELU函数以其近似

GELU，即Gaussian Error Linear Unit，在论文《Gaussian Error Linear Units (GELUs)》提出，被广泛运用于各大LLM中。

GELU函数

正如名字所说，他与高斯分布/正态分布有关。
$$
\begin{equation}\text{GELU}(x)=x \Phi(x)\end{equation}
$$
其中
$$
\begin{equation}\Phi(x)=\int_{-\infty}^x \frac{e^{-t^2/2}}{\sqrt{2\pi}}dt=\frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]\end{equation}
$$

由于erf并不好求，故有以下近似。

近似1

$$
\begin{equation}x\Phi(x)\approx x\sigma(1.702 x)\label{eq:x-sigma}\end{equation}
$$

$x\sigma(\beta x)$其实也被叫做swish，当$\beta=1$时，叫做SiLU

Swish由Google brain于2017年提出。

他遍历了所有激活函数，并得出最好的是Swish。

定义的激活函数如下：

其中

Unary functions:
$x - x, |x|, x^2, x^3, \sqrt{x}, \beta x, x + \beta, \log(|x| + t), exp(x) \sin(x), cos(x)$,
$sinh(x), cosh(x), tanh(x), sinh^{-1}(x), tan^{-1}(x), sinc(x), max(x,0), min(x,0), \sigma(x)$,
$\log(1 + exp(x)), exp(-x^2), erf(x), \beta$
Binary functions:
$x_1 + x_2, x_1 \times x_2, x_1 - x_2, \frac{x_1}{x_2 + 1}, max(x_1, x_2), min(x_1, x_2), \sigma(x_1) \cdot x_2$,
$exp(-\beta (x_1 - x_2)^2), exp(-\beta |x_1 - x_2|), \beta x_1 + (1 - \beta)x_2$

也有一篇论文从数学的角度上求解最佳的$\beta$：《A logistic approximation to the cumulative normal distribution》

近似2

$$
\begin{equation}x\Phi(x)\approx \frac{1}{2} x \left[1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715 x^3\right)\right)\right]\label{eq:x-phi}\end{equation}
$$

可以追述到1977年的《Approximations to the Cumulative Normal Function and its Inverse for Use on a Pocket Calculator》

它基于1963年Tocher提出的：
$$
F ( x )=\int_{-\infty}^{x} ! {\frac{1} {\sqrt{( 2 \pi)}}} , e^{-\frac{1}{2} u^{2}} d u \simeq e^{2 k x} / ( 1+e^{2 k x} )=\frac{1}{2}(1+tanh(kx)), \ \ k=\sqrt{( 2 / \pi)}
$$
在其论文中，进一步对$e^{2 k x} / ( 1+e^{2 k x} )$改进为$e^{2 k_1 x(1+k_2x^2)} / ( 1+e^{2 k_1 x(1+k_2x^2)} )$，并得到以上的近似。。

但论文实际并没有给出具体使用什么方法求解出最佳系数。

PS：苏神对以上进一步使用程序的寻找最佳系数。

#深度学习

GELU函数以其近似

https://lijianxiong.work/2025/20250216/

作者

LJX

发布于

2025年2月16日

许可协议

火烧金阁寺上一篇

Mixture of Experts(MoE) 下一篇