Part V: Efficient Natural-gradient Methods for Exponential Family

18 minute read

Warning: working in Progress (incomplete)

Goal (edited: 01-Jan-23)

This blog post should show that we can efficiently implement natural-gradient methods in many cases.

We will give an informal introduction with a focus on high level of ideas.

Click to see how to cite this blog post
@misc{lin2021NGDblog05,
  title = {Introduction to Natural-gradient Descent: Part V},
  author = {Lin, Wu and Nielsen, Frank and Khan, Mohammad Emtiyaz and Schmidt, Mark},
  url = {https://informationgeometryml.github.io/year-archive/}, 
  howpublished = {\url{https://informationgeometryml.github.io/posts/2021/12/Geomopt05/}},
  year = {2021},
  note = {Accessed: 2021-12-14}
}

Exponential Family


An exponential family takes the following (canonical) form as $$ \begin{aligned} p(\mathbf{w}|\mathbf{\eta}) = h_\eta(\mathbf{w}) \exp( \langle \mathbf{\eta} , \mathbf{T}_\eta (\mathbf{w}) \rangle - A_\eta (\mathbf{\eta}) ) \end{aligned} $$ where $C_\eta(\eta) := \int h_\eta(\mathbf{w}) \exp( \langle \mathbf{\eta} , \mathbf{T}(\mathbf{w}) \rangle ) d \mathbf{w} $ is the normalization constant.

  name
$A_\eta(\mathbf{\eta}):=\log C_\eta(\eta)$ log-partition function
$h_\eta(\mathbf{w})$ base measure
$\mathbf{T}_\eta(\mathbf{w})$ sufficient statistics
$\eta$ natural parameter

The parameter space of $\eta$ denoted by $\Omega_\eta$ is determined so that the normalization constant is well-defined and (strictly and finitely) positive.

Regular natural parametrization $\eta$: parameter space $\Omega_\eta$ is open.

In this post, we only consider regular natural parametrizations since commonly used exponential family distributions have a regular natural parametrization. This natural parametrization is special since the inner product $\langle \mathbf{\eta} , \mathbf{T}_\eta(\mathbf{w}) \rangle$ is linear in $\eta$. As we will discuss later, this linearity is essential.

Examples of an exmponential family are Gaussian, Bernoulli, Von Mises–Fisher, and more.

Readers should be aware of the following points when using an exponential family.

  • The support of $\mathbf{w}$ should not depend on parametrization $\eta$.

  • The base measure and the log-partition function are only unique up to a constant as illustrated by the following example.
    Univariate Gaussian as an exponential family (click to expand)

    Recall that in Part I, we consider this family as $ \{ \mathcal{N}(w |\mu,\sigma) \Big| \mu \in \mathcal{R}, \sigma>0 \}$ with mean $\mu$ and variance $\sigma$, where $\mathcal{N}(w |\mu,\sigma) = \frac{1}{\sqrt{2\pi \sigma} } \exp [- \frac{(w-\mu)^2}{2\sigma} ] $.

    We re-express it in an exponential form as

    $$ \begin{aligned} p({w}|\mathbf{\eta}) &= \frac{1}{\sqrt{2\pi \sigma} } \exp [- \frac{(w-\mu)^2}{2\sigma} ] \\ &= \underbrace{ \exp(0) }_{ h_\eta({w}) } \exp( \langle \underbrace{\begin{bmatrix} -\frac{1}{2\sigma} \\ \frac{\mu}{\sigma} \end{bmatrix}}_{\mathbf{\eta} } , \underbrace{\begin{bmatrix} w^2 \\ w \end{bmatrix}}_{ \mathbf{T}_\eta ({w}) } \rangle - \frac{1}{2} [ \log ( 2\pi ) + \log \sigma + \frac{\mu^2}{\sigma} ] ) \\ \end{aligned} $$ Since $\sigma= -\frac{1}{2\eta_1} $ and $\mu = -\frac{\eta_2}{2\eta_1}$, $A_\eta(\mathbf{\eta}) = \frac{1} {2} [ \log ( 2\pi ) + \log \sigma + \frac{\mu^2}{\sigma} ] = \frac{1}{2} [ \log ( 2\pi ) + \log (-\frac{1}{2\eta_1})-\frac{\eta_2^2}{2\eta_1} ] $.

    It is also valid that $ h_\eta({w}) = \frac{1}{\sqrt{2\pi}} $ and $A_\eta(\mathbf{\eta}) = \frac{1}{2} [ \log (-\frac{1}{2\eta_1})-\frac{\eta_2^2}{2\eta_1} ] $.

    We can easily verify that parameter space $\Omega_\eta= \{ (\eta_1,\eta_2) | \eta_1<0 , \eta_2 \in \mathcal{R} \}$ is open (in $\mathcal{R}^2$).

  • The log-partition function can be differentiable w.r.t. $\eta$ [1] even when $\mathbf{w}$ is discrete.
    Bernoulli as an exponential family (click to expand)

    Recall that in Part I, we consider this family as $ \{ \mathcal{I}(w=0) \pi + \mathcal{I}(w=1) (1-\pi) \Big| 0<\pi<1 \}$

    We re-express it in an exponential form as

    $$ \begin{aligned} p({w}|\mathbf{\eta}) &= \mathcal{I}(w=0) \pi + \mathcal{I}(w=1) (1-\pi) \\ &=\underbrace{ \exp(0) }_{ h_\eta({w}) } \exp( \langle \underbrace{ \log \frac{\pi}{1-\pi}}_{\eta} , \underbrace{ \mathcal{I}(w=0)}_{T_\eta(w) } \rangle - \log \frac{1}{1-\pi} ) \end{aligned} $$ Since $\pi = \frac{\exp(\eta)}{1+ \exp(\eta) } $ , we have $A_\eta(\mathbf{\eta}) = \log \frac{1}{1-\pi} = \log(1+\exp(\eta))$.
    We easily to verify that parameter space $\Omega_\eta= \{ \eta | \eta \in \mathcal{R} \}$ is open in $\mathcal{R}$.

  • An invertiable linear reparametrization could also be a natural parametrization. Thus, in this setting, natural-gradient descent could be better than (Euclidean) gradient descent since natural-gradient descent is linearly invariant.
    Natural parametrization is not unique (click to expand)

    For simplicity, let’s assume set $\Omega_\lambda := \{ \mathbf{U}^{-1} \eta | \eta \in \Omega_\eta \} = \Omega_\eta$, where $\mathbf{U}$ is a constant invertiable matrix.

    Consider a linear reparametrization such as $\lambda=\mathbf{U}^{-1} \mathbf{\eta}$, parametrization $\lambda$ is also a natural parametrization as $$ \begin{aligned} p(\mathbf{w}|\mathbf{\lambda}) &= h_\eta(\mathbf{w}) \exp( \langle \mathbf{U}\mathbf{\lambda} , \mathbf{T}_\eta(\mathbf{w}) \rangle - A_\eta( \mathbf{U}\mathbf{\lambda}) ) \\ &= h_\eta(\mathbf{w})\exp( \langle \mathbf{\lambda} , \mathbf{U}^T \mathbf{T}_\eta(\mathbf{w}) \rangle - A_\eta(\mathbf{U}\mathbf{\lambda}) ) \\ &= h_\lambda(\mathbf{w}) \exp( \langle \mathbf{\lambda} , \mathbf{T}_\lambda(\mathbf{w}) \rangle - A_\lambda(\mathbf{\lambda}) ) \end{aligned} $$ where $h_\lambda(\mathbf{w}):= h_\eta(\mathbf{w})$, $\mathbf{T}_\lambda(\mathbf{w}):= \mathbf{U}^T\mathbf{T}_\eta(\mathbf{w})$, and $A_\lambda(\mathbf{\lambda}):= A_\eta(\mathbf{U}\mathbf{\lambda})$.

Minimal Parametrizations of Exponential Family

Now, we discuss particular parametrizations of an exponential family. We could efficiently compute natural-gradients when using this class of parametrizations.

Minimal natural parametrization: the corresponding sufficient stattistics $\mathcal{T}(\mathbf{w})$ is linearly independent.

A regular, minimal, and natural parametrization $\eta$ has many nice properties [1] [2].

  • It is an intrinsic parametrization as we discussed in Part I.

  • The parameter space $\Omega_\eta$ is an open set in $\mathcal{R}^K$, where $K$ is the number of entires of this parameter array.

  • The log-partition function $A_\eta(\eta)$ is infinitely differentiable and strictly convex in $\Omega_\eta$.

  • The FIM $\mathbf{F}_\eta(\eta) = \nabla_\eta^2 A_\eta(\eta)$ is positive-definite in its domain.

We will only show the first property in this post. The remaining properties can be found in the literature. Note that the linearity in the inner product $\langle \mathbf{\eta} , \mathbf{T}_\eta(\mathbf{w}) \rangle$, plays a key role in showing these properties.

Claim:

A regular, minimal, and natural parametrization is intrinsic.

Proof of the claim (click to expand)

Proof by contradiction:

Recall that a parametrization is intrinsic if $K$ partial derivatives $ \{ \partial_{\eta_i} \log p(\mathbf{w}|\eta) \} $ are linearly independent. Since the parameter space is open in $\mathcal{R}^K$ and the log-partition function $A_\eta(\eta)$ is differentiable, these partial derivatives are well-defined and can be computed as $$ \begin{aligned} \partial_{\eta_i} \log p(\mathbf{w}|\eta) = \langle \mathbf{e}_i, \mathbf{T} (\mathbf{w}) \rangle - \partial_{\eta_i}A_\eta(\eta) \end{aligned} $$ where $\mathbf{e}_i$ is an one-hot/unit array which has zero in all entries except the $i$-th entry.

Suppose a regular, minimal, and natural parametrization is not intrinsic. These partial derivatives must be linearly dependent. Thus, there exist a set of non-zero constant $c_i$ such that $\sum_i c_i \partial_{\eta_i} \log p(\mathbf{w}|\eta)= 0 $, where the value of $c_i$ does not depent on $\mathbf{w}$. $$ \begin{aligned} 0 &= \sum_{i=1}^{K} c_i \partial_{\eta_i} \log p(\mathbf{w}|\eta) \\ &= \sum_{i=1}^{K} c_i \langle \mathbf{e}_i, \mathbf{T} (\mathbf{w}) \rangle - c_i\partial_{\eta_i}A_\eta(\eta) \\ \end{aligned} $$

Since $c_i$ is a non-zero constant and its value does not depent on $\mathbf{w}$, we must have $$ \begin{aligned} 0 =\sum_{i=1}^{K} c_i \langle \mathbf{e}_i, \mathbf{T} (\mathbf{w}) \rangle , \end{aligned} $$ which implies that the sufficient stattistics $\mathcal{T}(\mathbf{w})$ is linearly dependent. This is a contradiction since $\eta$ is a minimal natural parametrization.

Now, we give an example of a regular, minimal and natural parametrization.

Minimal parametrization for multivariate Gaussian (click to expand)

Consider a $d$-dimensional Gaussian family.

We specify a parameterization $\mathbf{\tau}$ of the family as $ \{ \mathcal{N}(\mathbf{w} |\mathbf{\mu},\mathbf{\Sigma}) \Big| \mathbf{\mu} \in \mathcal{R}^d, \mathbf{\Sigma} \succ \mathbf{0} \}$ with mean $\mathbf{\mu}$ and covariance $\mathbf{\Sigma}$.

We re-express it in an exponential form as

$$ \begin{aligned} p({w}|\mathbf{\lambda}) &= \frac{1}{\sqrt{ \mathrm{det}( 2\pi \Sigma )} } \exp [- \frac{1}{2} (\mathbf{w}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{w}-\mathbf{\mu}) ] \\ &= \underbrace{ \exp(0) }_{ h_\lambda({w}) } \exp( \langle \underbrace{\begin{bmatrix} -\frac{1}{2} \mathrm{vec}( \Sigma^{-1} ) \\ \Sigma^{-1}\mu \end{bmatrix}}_{\mathbf{\lambda} } , \underbrace{\begin{bmatrix} \mathrm{vec}( \mathbf{w} \mathbf{w}^T) \\ \mathbf{w} \end{bmatrix}}_{ \mathbf{T}_\lambda ({w}) } \rangle - \frac{1}{2} [ d\log ( 2\pi ) + \log \mathrm{det} (\Sigma) + \mu^T \Sigma^{-1} \mu ] ) \\ \end{aligned} $$ where $\mathrm{vec}()$ is the vectorization function and $\lambda$ is a $(d+d^2)$-dim array.

Parametrization $\lambda$ is a natural and regular parametrization. However, it is NOT a minimal parametrization since $\mathbf{w} \mathbf{w}^T$ is symmetric and therefore the sufficient statistics is linearly dependent. It can be shown that $\Omega_\lambda$ is relatively open but not open in $\mathcal{R}^K$, where $K=d+d^2$.

Recall that $\mathrm{vech}()$ is the half-vectorization function. We define a new map $\mathrm{vec2h}(\mathrm{S}):=\mathrm{vech}\big(2\mathrm{S} - \mathrm{Diag}( \mathrm{diag}(\mathrm{S}) ) \big)$, which is like map $\mathrm{vech}()$ except that the off-diagonal entries in the low-triangular part are multiplied by 2.

A minimal natural parametrization $\eta$ should be defined as $$ \begin{aligned} p({w}|\mathbf{\eta}) &= \underbrace{ \exp(0) }_{ h_\eta({w}) } \exp( \langle \underbrace{\begin{bmatrix} -\frac{1}{2} {\color{red}\mathrm{vech}}( \Sigma^{-1} ) \\ \Sigma^{-1}\mu \end{bmatrix}}_{\mathbf{\eta} } , \underbrace{\begin{bmatrix} {\color{red}\mathrm{vec2h} }( \mathrm{w}\mathbf{w}^T ) ) \\ \mathbf{w} \end{bmatrix}}_{ \mathbf{T}_\eta ({w}) } \rangle - \frac{1}{2} [ d\log ( 2\pi ) + \log \mathrm{det} (\Sigma) + \mu^T \Sigma^{-1} \mu ] ) \\ \end{aligned} $$ where $\eta$ is a $(d+\frac{d(d+1)}{2})$-dim array. It can be shown that $\Omega_\eta$ is open in $\mathcal{R}^K$, where $K=d+\frac{d(d+1)}{2}$.

As we discussed in Part II, $\mathrm{vech}(\Sigma^{-1})$ is an intrinsic parameterization while $\mathrm{vec}(\Sigma^{-1})$ is not.

The following example illustrates a non-minimal natural parametrization

Non-minimal parametrization for Bernoulli (click to exapnd)

We consider this family in the previous section with another parametrization dicussed in Part I. $ \{ \mathcal{I}(w=0) \frac{\pi_1}{\pi_1+\pi_2} + \mathcal{I}(w=1) \frac{\pi_2}{\pi_1+\pi_2} \Big| \pi_1>0, \pi_2>0 \}$

We re-express it in another exponential form as

$$ \begin{aligned} p({w}|\mathbf{\eta}) &= \mathcal{I}(w=0) \frac{\pi_1}{\pi_1+\pi_2} + \mathcal{I}(w=1) \frac{\pi_2}{\pi_1+\pi_2} \\ &=\underbrace{ \exp(0) }_{ h_\eta({w}) } \exp( \langle \underbrace{ \begin{bmatrix} \log \frac{\pi_1}{\pi_1+\pi_2} \\ \log \frac{\pi_2}{\pi_1+\pi_2} \end{bmatrix} }_{\eta} , \underbrace{ \begin{bmatrix} \mathcal{I}(w=0) \\ \mathcal{I}(w=1)\end{bmatrix} }_{\mathbf{T}_\eta(w) } \rangle -\underbrace{ 0}_{A_\eta(\eta)} ) \end{aligned} $$

The natural parameter space $\Omega_\eta= \{ (\eta_1,\eta_2) | \exp(\eta_1)+\exp(\eta_2) = 1 \}$ is not open in $\mathcal{R}^2$.

This is not a minimal natural parametrization since the sufficient statistics $\mathbf{T}_\eta(w)$ is linearly dependent as $\mathcal{I}(w=0)+\mathcal{I}(w=1)=1$.

Efficient Natural-gradient Computation

In general, natural-gradient computation can be challenging due to the inverse of the Fisher matrix. In cases of an exponential family, natural-gradient computation often can be quite efficient without directly computing the inverse of the Fisher matrix. We will assume $\eta$ is a reguar, minimal, and natural parametrization.

Expectation Parametrization

We introduce a dual parametrization $\mathbf{m} := E_{p(w|\eta)}[ \mathbf{T}_\eta(\mathbf{w}) ] $, which is known as the expectation parametrization. This new parametrization plays a key role for the efficient natural-gradient computation.

Recall that in Part IV, we use the identity of the score function. We can establish a connection between these two parametrizations via this identity. $$ \begin{aligned} \mathbf{0} & = E_{p(w|\eta)} [ \nabla_\eta \log p(\mathbf{w}|\eta) ] &\,\,\,( \text{expectation of the score is zero} ) \\ &=E_{p(w|\eta)} [ \mathbf{T}_\eta(\mathbf{w}) - \nabla_\eta A_\eta(\eta) ] \\ &= \mathbf{m} - \nabla_\eta A_\eta(\eta) \end{aligned}\tag{1}\label{1} $$ which is a valid Legendre (dual) transformation1 since $\nabla_\eta^2 A_\eta(\eta)$ is positive-definite in its domain $\Omega_\eta$.

Expectation Parameter Space

We can view the expectation parameter as an ouput of a continous map $$ \begin{aligned} \mathbf{m}(\eta):=\nabla_\eta A_\eta(\eta), \end{aligned}\tag{2}\label{2} $$ where the input space of this map is $\Omega_\eta$.

We define the expectation parameter space as the output space of the map $$ \begin{aligned} \Omega_m :=\{\mathbf{m}(\eta) | \eta \in \Omega_\eta \}. \end{aligned} $$

Since $\nabla_\eta^2 A_\eta(\eta)$ is positive-definite in open set $\Omega_\eta$, we can show that there exists an one-to-one relationship between the natural parameter $\eta$ and the expectation parameter $\mathbf{m}$, which implies that map $\mathbf{m}(\cdot)$ is injective.

Since $\Omega_\eta$ is open in $\mathcal{R}^K$, we can show that the expectation parameter space $\Omega_m$ is also open in $\mathcal{R}^K$ due to the invariance of domain.

Natural-gradient Computation

Note that the FIM of the exponential family under parametrization $\eta$ is $$ \begin{aligned} \mathbf{F}_\eta(\eta)=\nabla_\eta^2 A_\eta(\eta) = \nabla_\eta \mathbf{m} \end{aligned}\tag{3}\label{3} $$ which means this FIM is a Jacobian matrix $\mathbf{J}=\nabla_\eta \mathbf{m}(\eta)$.

As we discussed in Part II, a natural-gradient w.r.t. $f(\eta)$ can be computed as below, where $\mathbf{g}_\eta=\nabla_\eta f(\eta)$ is a Euclidean gradient w.r.t. natural parameter $\eta$. $$ \begin{aligned} \hat{\mathbf{g}}_\eta & = \mathbf{F}^{-1}_\eta(\eta) \mathbf{g}_\eta \\ &= (\nabla_\eta \mathbf{m} )^{-1} [ \nabla_\eta f(\eta) ] \\ &= [\nabla_{m} \eta ] [ \nabla_\eta f(\eta) ] \\ &= \nabla_{m} f(\eta) \end{aligned}\tag{4}\label{4} $$ where $\nabla_{m} f( \eta )$ is a Euclidean gradient w.r.t. expectation parameter $\mathbf{m}$ and $\eta=\eta( \mathbf{m} )$ can be viewed as a function of $\mathbf{m}$.

Therefore, we can efficinetly compute natural-gradients w.r.t. natural parameter $\eta$ [3] if we can easily compute Euclidean gradients w.r.t. its expectation parameter $\mathbf{m}$.

Efficient NGD for multivariate Gaussians

Given a $d$-dim multivariate Gaussian with a full covariance structure, the naive way to compute natural-gradients in this case has $O( (d^2)^3 )=O(d^6)$ iteration cost since the covariance matrix has $d^2$ entries. Now, we show how to efficiently compute natural-gradients in this case.

Claim:

We can efficiently compute natural-gradients in $O( d^3 )$ iteration cost in this case with the number of $O(d^2)$ parameters.

Proof of the claim (Click to expand)

As shown in the previous section, a minimal,regular, and natural parameter is $\eta=(\mathbf{S}\mu,-\frac{1}{2}\mathrm{vech}(\mathbf{S}) )$, where $\Sigma$ is the covariance and $\mathbf{S}= ( \Sigma )^{-1}$. $\Sigma$ is considered as a d-by-d matrix with $\frac{d(d+1)}{2}$ distinct entries (degrees of freedom).

The corresponding expectation parameter is $\mathbf{m}=( E_{p(w)}[ \mathbf{w} ], E_{p(w)}[ \mathrm{vec2h}( \mathbf{w}\mathbf{w}^T ) ]) = (\mu, \mathrm{vec2h}(\mu\mu^T+\Sigma ) ) $, where $\mathrm{vec2h}(\mathrm{S}):=\mathrm{vech}\big(2\mathrm{S} - \mathrm{Diag}( \mathrm{diag}(\mathrm{S}) ) \big)$ is defined in the previous section.

By using the result in this section, we can efficiently compute each natural gradient w.r.t. $\eta$ by computing the corresponding Eucldiean gradient w.r.t. $\mathbf{m}$.

Denote $\mathbf{m}^{(1)}:=\mu$ and $\mathbf{m}^{(2)}:=\mathrm{vec2h}(\mu\mu^T+\Sigma)$.

Notice that we can re-express the mean and the covariance in terms of $\mathbf{m}=(\mathbf{m}^{(1)}, \mathbf{m}^{(2)})$ as $$ \begin{aligned} \mu = \mathbf{m}^{(1)} , \mathrm{vec2h}(\Sigma) = \mathbf{m}^{(2)} -\mathrm{vec2h}( \mathbf{m}^{(1)}(\mathbf{m}^{(1)})^T) \end{aligned} $$

Given a smooth scalar function denoted by $\ell$, by the chain rule, we have $$ \begin{aligned} \overbrace{ \frac{\partial \ell}{\partial m^{(1)}_i} }^{\text{scalar}}& = \overbrace{ \frac{\partial \ell}{\partial \mu}}^{\text{row vector } } \overbrace{ \frac{\partial \mu }{\partial m^{(1)}_i}}^{\text{column vector}} + \frac{\partial \ell}{\partial \mathrm{vec2h}(\Sigma)} \frac{\partial \mathrm{vec2h}(\Sigma) }{\partial m^{(1)}_i} \\ \frac{\partial \ell}{\partial m^{(2)}_i} & = \frac{\partial \ell}{\partial \mu} \frac{\partial \mu }{\partial m^{(2)}_i} + \frac{\partial \ell}{\partial \mathrm{vec2h}(\Sigma)} \frac{\partial \mathrm{vec2h}(\Sigma) }{\partial m^{(2)}_i} \end{aligned} $$ where $m^{(1)}_i$ denotes the $i$-th entry of $\mathbf{m}^{(1)}$.

First, notice that we can efficiently compute the following Euclidean gradients by Auto-Diff $$ \begin{aligned} \frac{\partial \ell}{\partial \mu} &= \mathbf{g}_\mu^T \\ \frac{\partial \ell}{\partial V } & =\mathbf{g}_{V} \,\,\,\, ( \mathbf{V} \text{ is a d-by-d matrix with } d^2 \text{ degrees of freedom} ) \\ \frac{\partial \ell}{\partial \Sigma } & =\mathbf{g}_{\Sigma} \,\,\,\, ( \Sigma \text{ is a d-by-d matrix with } \frac{d(d+1)}{2} \text{ degrees of freedom} ) \\ \frac{\partial \ell}{\partial \mathrm{vec2h}(\Sigma)} &= ( \mathrm{vech}(\mathbf{g}_{V}))^T \end{aligned} $$ where $\mathbf{V}$ takes the same value of $\Sigma$.

Note:

$$ \begin{aligned} \mathrm{vech}(\mathbf{g}_\Sigma) = \mathrm{vec2h}(\mathbf{g}_V) \end{aligned} $$

We can verify the following identity for $\mathbf{m}^{(1)}$. $$ \begin{aligned} \frac{\partial \ell}{\partial \mathrm{vec2h}(\Sigma)} \frac{\partial \mathrm{vec2h}(\Sigma) }{\partial m^{(1)}_i} &= \mathrm{Trace}\big(\frac{\partial \ell}{\partial \mathbf{V}} \frac{\partial \mathbf{V} }{\partial m^{(1)}_i} \big) \end{aligned} $$ where $\mathbf{V}= \mathbf{m}^{(2)} - \mathbf{m}^{(1)}(\mathbf{m}^{(1)})^T$.

By the identity, we can easily compute the Euclidean gradients w.r.t. the expectation parameter $\mathbf{m}= (\mathbf{m}^{(1)}, \mathbf{m}^{(2)})$ as

$$ \begin{aligned} \frac{\partial \ell}{\partial m^{(1)}} &= \mathbf{g}_\mu^T - 2 (\mathbf{m}^{(1)})^T \mathbf{g}_{V}^T = \mathbf{g}_\mu^T - 2 \mu^T \mathbf{g}_{V}^T \\ \frac{\partial \ell}{\partial m^{(2)}} &= \mathbf{0} + ( \mathrm{vech}(\mathbf{g}_{V}))^T = ( \mathrm{vech}(\mathbf{g}_{V}))^T \end{aligned} $$

Natural-gradient Descent as Unconstrained Mirror Descent

We assume natural parametrization $\eta$ is both regular and minimal. In exponential family cases, we will show that natural-gradient descent as a mirror descent update when the natural parameter space $\Omega_\eta$ is unconstrained.

Since mirror descent is defined by using a Bregman divergence, we first introduce the Bregman divergence.

Bregman Divergence

Given a strictly convex function $\Phi(\cdot)$ in its domain, a Bregman divergence equipped with $\Phi(\cdot)$ is defined as $$ \begin{aligned} \mathrm{B}_\Phi(\mathbf{x},\mathbf{y}):= \Phi(\mathbf{x})- \Phi(\mathbf{y}) - \langle \nabla \Phi(\mathbf{y}), (\mathbf{x}-\mathbf{y}) \rangle \end{aligned} $$

In particular, the Kullback–Leibler (KL) divergence is a Bregman divergence under natural parametrization $\eta$: $$ \begin{aligned} \, & \,\mathrm{KL} [p(\mathbf{w}| \eta_1 ) || p(\mathbf{w}|{\color{red}\eta_2})]\\ =& \,E_{p(w|\eta_1)} [ \log \frac{p(\mathbf{w}|\eta_1)} {p(\mathbf{w}|\eta_2)} ] \\ =& \,E_{p(w|\eta_1)} [ \langle \eta_1-\eta_2 , \mathbf{T}_\eta (\mathbf{w}) \rangle - A_\eta(\eta_1) + A_\eta(\eta_2) ] & ( p(\mathbf{w}|\eta) \text{ is an exponential family}) \\ =& \,A_\eta(\eta_2) - A_\eta(\eta_1) - E_{p(w|\eta_1)} [ \langle \eta_2-\eta_1, \mathbf{T}_\eta (\mathbf{w}) \rangle ] \\ =& \,A_\eta(\eta_2) - A_\eta(\eta_1) - \langle \eta_2-\eta_1, \underbrace{ E_{p(w|\eta_1)} [ \mathbf{T}_\eta (\mathbf{w}) ] }_{ \nabla_\eta A_\eta(\eta_1) } \rangle \\ =& \, \mathrm{B}_{A_\eta}( {\color{red} \mathbf{\eta}_2}, \mathbf{\eta}_1 ) & ( A_\eta(\eta) \text{ is strictly convex}) \end{aligned} $$

We denote the expectation parameter as $\mathbf{m}$. Recall that by the Legendre transformation, we have $\mathbf{m}=\nabla_\eta A_\eta(\eta)$, where $\Omega_m$ has been defined here. Note that we assume natural parameter $\eta$ is minimal. In other words, $\nabla_\eta^2 A_\eta(\eta)$ is positive-definite and $A_\eta(\eta)$ is strictly convex in $\Omega_\eta$.

We define $A_\eta(\mathbf{x}):=+\infty$ when $\mathbf{x} \not \in \Omega_\eta$. The convex conjugate of the log-partition function $A_\eta$ is defined as $$ \begin{aligned} A^*_\eta( \mathbf{m}) &:= \sup_{x} \{ \langle \mathbf{x},\mathbf{m} \rangle - A_\eta(\mathbf{x}) \} \\ &= \langle \mathbf{\eta},\mathbf{m} \rangle - A_\eta(\mathbf{\eta}) \,\,\,\, (\text{the supremum attains at } \mathbf{x}=\eta \in \Omega_\eta )\\ \end{aligned}\tag{5}\label{5} $$ where the domain of $A^*_\eta(\mathbf{m})$ is $\Omega_m$, and $\eta=\eta( \mathbf{m} )$ should be viewed as a function of $\mathbf{m}$.

When $\mathbf{m} \in \Omega_m$, we have the following identity, which is indeed another Legendre transformation. $$ \begin{aligned} \nabla_{\mathbf{m}} A^*_\eta( \mathbf{m}) &= \mathbf{\eta} + \langle \nabla_{\mathbf{m}} \mathbf{\eta},\mathbf{m} \rangle - \nabla_{\mathbf{m}} A_\eta(\mathbf{\eta}) \\ &= \mathbf{\eta} + \langle \nabla_{\mathbf{m}} \mathbf{\eta},\mathbf{m} \rangle - [\nabla_{\mathbf{m}} \eta] \underbrace{ [\nabla_\eta A_\eta(\mathbf{\eta})] }_{ = \mathbf{m}}\\ &= \mathbf{\eta} , \end{aligned}\tag{6}\label{6} $$ where $\eta \in \Omega_\eta$ due to $\eqref{5}$.

The convex conjugate $A^*_\eta( \mathbf{m})$ is strictly convex w.r.t. $\mathbf{m}$ since the Hessian $\nabla_m^2 A^*_\eta( \mathbf{m})$ is positive-definite as shown below.

$$ \begin{aligned} \nabla_{\mathbf{m}}^2 A^*_\eta( \mathbf{m}) &= \nabla_{\mathbf{m}} \mathbf{\eta} \end{aligned} $$

Note that due to $\eqref{3}$, the FIM $\mathbf{F}_\eta(\eta)$ under natural parameter $\mathbf{\eta}$ is the Jacobian matrix $\mathbf{J}= \nabla_{\eta} \mathbf{m}$ and positive-definite.

Therefore, it is easy to see that $ \nabla_{\mathbf{m}}^2 A^*_\eta( \mathbf{m}) = \mathbf{F}^{-1}_\eta(\eta), $ which is positive-definite and therefore strictly convex.

By the transformation rule of the FIM, we have the following relationship. $$ \begin{aligned} \mathbf{F}_{\eta} (\eta) & = \mathbf{J}^T \mathbf{F}_{m}(\mathbf{m}) \mathbf{J} \\ &= \mathbf{F}^T_{\eta} (\eta) \mathbf{F}_{m}(\mathbf{m})\mathbf{F}_{\eta} (\eta) \\ &= \mathbf{F}_{\eta} (\eta) \mathbf{F}_{m}(\mathbf{m})\mathbf{F}_{\eta} (\eta) & (\text{the FIM is symmetric}) \end{aligned} $$ which implies that the FIM under expectation parameter $\mathbf{m}$ is $\mathbf{F}_m(\mathbf{m})=\mathbf{F}^{-1}_\eta(\eta) = \nabla_{\mathbf{m}}^2 A^*_\eta( \mathbf{m})= \nabla_{\mathbf{m}} \mathbf{\eta}$.

Moreover, we have the following identity since by $\eqref{5}$, $A_\eta(\eta)=\langle \mathbf{\eta},\mathbf{m} \rangle- A^*_\eta( \mathbf{m}) $. $$ \begin{aligned} \mathrm{B}_{A_\eta}(\mathbf{\eta}_2, {\color{red}\mathbf{\eta}_1 }) &= A_\eta(\eta_2) - A_\eta(\eta_1) - \langle \eta_2-\eta_1, \overbrace{ \nabla_\eta A_\eta(\eta_1) }^{= \mathbf{m}_1} \rangle \\ &= [ \langle \mathbf{\eta}_2,\mathbf{m}_2 \rangle- A^*_\eta( \mathbf{m}_2) ] -[ \langle \mathbf{\eta}_1,\mathbf{m}_1 \rangle- A^*_\eta( \mathbf{m}_1) ] -\langle \eta_2-\eta_1, \mathbf{m}_1 \rangle \\ &= A^*_\eta( \mathbf{m}_1) - A^*_\eta( \mathbf{m}_2) - \langle \mathbf{m}_1-\mathbf{m}_2, \underbrace{ \eta_2}_{ = \nabla_{\mathbf{m}} A^*_\eta( \mathbf{m}_2)} \rangle\\ &= \mathrm{B}_{A^*_\eta}( {\color{red} \mathbf{m}_1 },\mathbf{m}_2) & (\text{the order is changed}) \end{aligned} $$

Mirror Descent

Now, we give the definition of mirror descent.

Consider the following optimization problem over a convex domain denoted by $\Omega_\theta$. $$ \begin{aligned} \min_{\theta \in \Omega_\theta} \ell_\theta(\mathbf{\theta}) \end{aligned} $$

Given a strictly convex function $\Phi(\mathbf{\theta})$ in the domain , mirror descent with step-size $\alpha$ is defined as

$$ \begin{aligned} \mathbf{\theta}_{k+1} \leftarrow \arg \min_{x \in \Omega_\theta}\{ \langle \nabla_\theta \ell_\theta(\mathbf{\theta}_k), \mathbf{x}-\mathbf{\theta}_k \rangle + \frac{1}{\alpha} \mathrm{B}_{\Phi}(\mathbf{x},\mathbf{\theta}_k) \} \end{aligned} $$ where $\mathrm{B}_\Phi(\cdot,\cdot)$ is a Bregman divergence equipped with the strictly convex function $\Phi(\cdot)$.

Natural-gradient Descent as Mirror Descent

To show natural-gradient descent as a mirror descent update, we have to make the following assumption.

Additional assumption:

Natural parameter space $\Omega_\eta$ is unconstrainted ($\Omega_\eta=\mathcal{R}^K$), where $K$ is the number of entries of parameter array $\eta$.

The following example illustrates that the expectation space $\Omega_m$ is constrained even when $\Omega_\eta$ is unconstrained.

Example: $\Omega_m$ is constrained while $\Omega_\eta$ is unconstrained (click to expand)

Example: Bernoulli family

We consider this family as discussed in the previous section $ \{ \mathcal{I}(w=0) \pi + \mathcal{I}(w=1) (1-\pi) \Big| 0<\pi<1 \}$

We re-express it in an exponential form as

$$ \begin{aligned} p({w}|\mathbf{\eta}) &=\underbrace{ \exp(0) }_{ h_\eta({w}) } \exp( \langle \underbrace{ \log \frac{\pi}{1-\pi}}_{\eta} , \underbrace{ \mathcal{I}(w=0)}_{T_\eta(w) } \rangle - \log \frac{1}{1-\pi} ) \end{aligned} $$ where $\pi = \frac{\exp(\eta)}{1+ \exp(\eta) } $ and$A_\eta(\mathbf{\eta}) = \log \frac{1}{1-\pi} = \log(1+\exp(\eta))$.

The natural parameter space $\Omega_\eta= \{ \eta | \eta \in \mathcal{R} \}=\mathcal{R}^1$.

The corresponding expectation parameter is $m = E_{q(w|\eta)}[ T_\eta (w) ] = \pi$

The expectation parameter space $\Omega_m= \{ m| 0<m<1 \}$ is a constrained open set in $\mathcal{R}^1$.

Now, consider the following mirror descent in the expectation parameter space $\Omega_m$, where $\alpha>0$.

$$ \begin{aligned} \mathbf{m}_{k+1} \leftarrow \arg \min_{x \in \Omega_m}\{ \langle \nabla_m \ell_m(\mathbf{m}_k), \mathbf{x}-\mathbf{m}_k \rangle + \frac{1}{\alpha} \mathrm{B}_{A^*_\eta}(\mathbf{x},\mathbf{m}_k) \} \end{aligned}\tag{7}\label{7} $$ where $\mathbf{m}_{k} \in \Omega_m$, $\nabla_m \ell_m(\mathbf{m}_k):= \nabla_m \ell_\eta (\eta(\mathbf{m}_k))$ and the Bregman divergence $\mathrm{B}_{A^*_\eta}(\cdot,\cdot)$ is well-defined since $A^*_\eta$ is strictly convex in $\Omega_m$. Recall that $\Omega_m$ can still be constrained.

Claim:

When $\Omega_\eta = \mathcal{R}^K$, the solution of $\eqref{7}$ is equivalent to $\eta_{k+1} \leftarrow \eta_k - \alpha \nabla_m \ell_m(\mathbf{m}_k)$.

Proof of the claim (click to expand)

Proof :

Denote $$ \begin{aligned} u(\mathbf{x}) &:=\langle \nabla_m \ell_m(\mathbf{m}_k), \mathbf{x}-\mathbf{m}_k \rangle + \frac{1}{\alpha} [ \mathrm{B}_{A^*_\eta}(\mathbf{x},\mathbf{m}_k)] \\ & = \langle \nabla_m \ell_m(\mathbf{m}_k), \mathbf{x}-\mathbf{m}_k \rangle + \frac{1}{\alpha} [A^*_\eta( \mathbf{x}) - A^*_\eta( \mathbf{m}_k) - \langle \mathbf{x}-\mathbf{m}_k, \underbrace{ \nabla_{\mathbf{m}} A^*_\eta( \mathbf{m}_k)}_{ = \eta_k \text{ by } \eqref{6} } \rangle ], \end{aligned} $$ where $\mathbf{m}_k \in \Omega_m $.

A stationary point of $\eqref{7}$, $\hat{\mathbf{x}}$, must satisfy the following condition. $$ \begin{aligned} \mathbf{0} &= \nabla_x u(\hat{\mathbf{x}}) & (\mathbf{m}_k \text{ is considered to be a constant}) \\ &= \nabla_m \ell_m(\mathbf{m}_k)+ \frac{1}{\alpha} [ \nabla_x A^*_\eta( \hat{\mathbf{x}}) - \eta_k ], \end{aligned} $$ which implies that $\nabla_x A^*_\eta( \hat{\mathbf{x}}) = \eta_k - \alpha \nabla_m \ell_m(\mathbf{m}_k)$

We first show that there exists a stationary point in the domain $(\hat{\mathbf{x}} \in \Omega_m)$. Let’s denote $\mathbf{\eta}_{k+1}:= \eta_k - \alpha \nabla_m \ell_m(\mathbf{m}_k)$. Since $\Omega_\eta$ is unconstrained, it is obvious that $\mathbf{\eta}_{k+1} \in \Omega_\eta$. By $\eqref{2}$, $\mathbf{m}(\mathbf{\eta}_{k+1}) =\nabla_\eta A_\eta( \mathbf{\eta}_{k+1}) \in \Omega_m$. Notice that, by $\eqref{6}$, we have $\nabla_m A^*_\eta ( \mathbf{m}(\mathbf{\eta}_{k+1}) ) = \mathbf{\eta}_{k+1}$, which implies that $\mathbf{m}(\mathbf{\eta}_{k+1})$ is a stationary point and $\mathbf{m}(\mathbf{\eta}_{k+1}) \in \Omega_m$.

Moreover, $\mathbf{m}(\mathbf{\eta}_{k+1})$ is the unique solution of $\eqref{7}$ since $\nabla_x^2 u(\mathbf{x}) =\frac{1}{\alpha}\nabla_x^2 A^*_\eta( {\mathbf{x}})$ is positive-definite for any $\mathbf{x} \in \Omega_m$ and therefore strictly convex. In other words, $\mathbf{m}_{k+1} = \mathbf{m}( {\mathbf{\eta}_{k+1}}) $.

In summary, when $\Omega_\eta = \mathcal{R}^K$, the unique solution of $\eqref{7}$ is $$ \begin{aligned} \eta_{k+1} & \leftarrow \eta_k - \alpha \nabla_m \ell_m(\mathbf{m}_k) \\ \mathbf{m}_{k+1} & \leftarrow \nabla_\eta A_\eta( {\mathbf{\eta}_{k+1}}) \end{aligned} $$

By the claim, mirror descent of $\eqref{7}$ in expectation parameter space $\Omega_m$ is equivalent to the following update $$ \begin{aligned} \eta_{k+1} \leftarrow \eta_k - \alpha \nabla_m \ell_m(\mathbf{m}_k) = \eta_k - \alpha\nabla_m \ell_\eta( \underbrace{ \eta(\mathbf{m}_k) }_{= \eta_k}), \end{aligned}\tag{8}\label{8} $$ which is exactly natural gradient descent in natural parameter space $\Omega_\eta=\mathcal{R}^K$ since by $\eqref{4}$, we have $\nabla_m \ell_m(\mathbf{m}_k) = \nabla_m \ell_\eta( \eta_k)= \mathbf{F} _\eta^{-1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k)$.


References

[1] S. Johansen, Introduction to the theory of regular exponential families (Institute of Mathematical Statistics, University of Copenhagen, 1979).

[2] M. J. Wainwright & M. I. Jordan, Graphical models, exponential families, and variational inference (Now Publishers Inc, 2008).

[3] M. Khan & W. Lin, "Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models," Artificial Intelligence and Statistics (PMLR, 2017), pp. 878–887.

Footnotes:

  1. When the natural parameter $\eta$ is minimal, this Legendre transformation is diffeomorphic since $\nabla_\eta^2 A_\eta(\eta)$ is positive-definite in its domain. In other words, the expectation parameter $\mathbf{m}$ is also intrinsic when the natural parameter $\eta$ is minimal.