Part III: Invariance of Natural-Gradients

13 minute read

Goal (edited: 01-Jan-23)

This blog post focuses on invariant properties of natural-gradients. We will discuss

  • transformation rules of natural-gradients,
  • the automatic computation of re-parametrized natural-gradients via the transformation rules,
  • non-invariance of the Euclidean counterpart.

The discussion here is informal and focuses more on intuitions, rather than rigor.

Click to see how to cite this blog post
@misc{lin2021NGDblog03,
  title = {Introduction to Natural-gradient Descent: Part III},
  author = {Lin, Wu and Nielsen, Frank and Khan, Mohammad Emtiyaz and Schmidt, Mark},
  url = {https://informationgeometryml.github.io/year-archive/}, 
  howpublished = {\url{https://informationgeometryml.github.io/posts/2021/11/Geomopt03/}},
  year = {2021},
  note = {Accessed: 2021-11-02}
}

Parameter Transformation and Invariance


Recall that a Riemannian gradient is also known as a natural gradient since we use the Fisher-Rao metric $\mathbf{F}$ as the Riemannian metric.

In Part II, we have shown that a Riemannian gradient is a parametric representation of the tangent direction of a curve in a manifold. Since a curve and its tangent direction are geometric obejects, they should be invariant to the choice of parametrization. In other words, geometric properties should be persevered in any valid coordinate system. This is a coordinate-free argument.

The argument could be abstract for beginners. To be more concrete, we consider the corresponding coordinate-dependent argument: geometric properties should remain unchanged if we perform a (valid) parameter transformation. This coordinate-dependent argument also gives us a rule to compute natrual-gradients under a parameter transformation.

The transformation rules are summarized in the following table

Type of gradients transformation rules
Euclidean gradient (A.K.A. differential 1-form) $\eqref{6}$
Euclidean steepest descent direction (normalized Euclidean gradient) Non-invariant
Riemannian gradient $\eqref{7}$
Riemannian steepest descent direction (normalized Riemannian gradient) $\eqref{10}$

The following example illustrates these transformation rules shown in the Table.

Univariate Gaussian example: (click to expand)

Consider the following scalar function $$ \begin{aligned} h_\tau(\tau):= E_{q(w|\tau)} [ w^2 + \log q(w|\tau) ] = \mu^2 + \frac{1}{s} + \frac{1}{2} \log(s)- \frac{1}{2}(1+\log(2\pi)) \end{aligned} $$ where $q(w|\tau)= \mathcal{N}(w|\mu,s^{-1})$ is a Gaussian family with mean $\mu$, variance $s^{-1}$, intrinsic parametrization $\tau=(\mu,s)$, and parameter space $\Omega_\tau=\{(\mu,s)|\mu \in \mathcal{R},s>0 \}$.

The FIM of Gaussian $q(w|\tau)$ under parametrization $\tau$ is $$ \begin{aligned} \mathbf{F}_\tau (\tau) := -E_{q(w|\tau)} [ \nabla_\tau^2 \log q(w|\tau) ] = \begin{bmatrix} s & 0 \\ 0 & \frac{1}{2s^2} \end{bmatrix} \end{aligned} $$ We consider a member $\tau_0=(0.5,0.5)$ in the Gaussian family. The Euclidean gradient is $$ \begin{aligned} \mathbf{g}_\tau (\tau_0) := \nabla_\tau h_\tau(\tau_0) = \begin{bmatrix} 2 \mu \\ -\frac{1}{s^2} +\frac{1}{2s} \end{bmatrix}_{\tau=\tau_0} =\begin{bmatrix} 1 \\ -3 \end{bmatrix} \end{aligned} $$ The natural/Riemannian gradient is $$ \begin{aligned} \hat{\mathbf{g}}_\tau (\tau_0) := \mathbf{F}_\tau^{-1} (\tau_0) \nabla_\tau h_\tau(\tau_0) = \begin{bmatrix} 2 \mu s^{-1} \\ ( -\frac{1}{s^2} +\frac{1}{2s} ) (2s^2) \end{bmatrix}_{\tau=\tau_0} =\begin{bmatrix} 2 \\ -\frac{3}{2} \end{bmatrix} \end{aligned} $$

Now, consider the following re-parametrization of the function

$$ \begin{aligned} h_\lambda(\lambda)= E_{q(w|\lambda)} [ w^2 + \log q(w|\lambda)] = \mu^2 + v - \frac{1}{2} \log(v) - \frac{1}{2}(1+\log(2\pi)) \end{aligned} $$ where $q(w|\lambda)= \mathcal{N}(w|\mu,v)$ with mean $\mu$ and variance $v=s^{-1}$, intrinsic parametrization $\lambda=(\mu,v)$, and parameter space $\Omega_\lambda=\{(\mu,v)|\mu \in \mathcal{R},v>0 \}$.

The FIM of Gaussian $q(w|\lambda)$ under parametrization $\lambda$ is $$ \begin{aligned} \mathbf{F}_\lambda (\lambda):= -E_{q(w|\lambda)} [ \nabla_\lambda^2 \log q(w|\lambda) ] = \begin{bmatrix} \frac{1}{v} & 0 \\ 0 & \frac{1}{2v^2} \end{bmatrix} \end{aligned} $$

The Jacobian matrix is $$ \begin{aligned} \mathbf{J} :=\frac{\partial \lambda(\tau)}{\partial \tau} = \begin{bmatrix} 1 & 0 \\ 0 & -\frac{1}{s^2} \end{bmatrix} \end{aligned} $$ where $\lambda(\tau)=(\mu,v)=(\mu,\frac{1}{s})$ and $\tau=(\mu,s)$.

We can verify that $\eqref{8}$ holds for the FIM.

Consider the same member $\lambda_0=(0.5,2)$ in the Gaussian family. The Euclidean gradient is $$ \begin{aligned} \mathbf{g}_\lambda (\lambda_0) := \nabla_\lambda h_\lambda(\lambda_0) = \begin{bmatrix} 2 \mu \\ 1 - \frac{1}{2v} \end{bmatrix}_{\lambda=\lambda_0} =\begin{bmatrix} 1 \\ \frac{3}{4} \end{bmatrix} \end{aligned} $$ We can verify that $\eqref{6}$ holds for the Euclidean gradient.

The natural/Riemannian gradient is $$ \begin{aligned} \hat{\mathbf{g}}_\lambda (\lambda_0) := \mathbf{F}_\lambda^{-1} (\lambda_0) \nabla_\lambda h_\lambda(\lambda_0) = \begin{bmatrix} 2 \mu v \\ ( 1 - \frac{1}{2v} ) (2v^2) \end{bmatrix}_{\lambda=\lambda_0} =\begin{bmatrix} 2 \\ 6 \end{bmatrix} \end{aligned} $$

We can verify that $\eqref{7}$ holds for the Riemannian gradient

We will show that two key geometric properties remains the same under any intrinsic parameter transformation.

  1. Directional derivative
  2. Length of a Riemannian vector/gradient induced by the Fisher-Rao metric

Thanks to these properties, we will show that the optimal solution of the Riemannian steepest direction considered in Part II is invariant under an intrinsic parameter transformation. This is in contrast with the Euclidean steepest direction which is not invaraint under an intrinsic parameter transformation.

In Part II, we consider a point $\mathbf{x}_0$ in a manifold $\mathcal{M}$, a (1-dimensional) curve $\gamma(t)$, and a smooth scalar function $h: \mathcal{M} \to \mathcal{R}$. Given an intrinsic parametrization $\tau$ containing the point, we consider the following parametric representations.

geometric objects parametric representations
point $\mathbf{x}_0$ $\tau_0$
curve $\gamma(t)$ $\gamma_\tau(t) $
function $h(x_0)$ $h_\tau(\tau_0) $

Transformation Rules for Riemannian Gradients and Euclidean Gradients

Intuitively, the following identity should hold for any two intrinsic parametrizations $\tau$ and $\lambda$ that represent a common sub-set of points in a manifold. $$ \begin{aligned} h(\gamma(t)) = \underbrace{[h \circ \Psi_\tau^{-1}]}_{h_\tau}( \underbrace{ \Psi_\tau( \gamma(t) ) }_{:= \gamma_\tau(t) } ) = \underbrace{[h \circ \Psi_\lambda^{-1}]}_{h_\lambda}( \underbrace{ \Psi_\lambda( \gamma(t) ) }_{:= \gamma_\lambda(t) } ) \end{aligned} $$ where we consider $t$ to be fixed and $\Psi_\tau$ is known as the coordinate/parametrization map for parametrization $\tau$ (i.e., $\Psi_\tau: \mathcal{M} \to \mathcal{R}^{\mathrm{dim}(\mathcal{M})}$).

Technically speaking, domain $\mathbf{I}_\tau$ of curve $\gamma_\tau(t)$ and domain $\mathbf{I}_\lambda$ of curve $\gamma_\lambda(t)$ may be different. For simplicity, we assume both domains are open intervals containing 0. In other words, $\gamma_\tau(0)=\tau_0$ and $\gamma_\lambda(0)=\lambda_0$ are parametric representations of the same point $\mathbf{x}_0$.

From the above expression, we can see that directional derivatives should be the same at $t=0$ $$ \begin{aligned} \frac{d h_\tau({\gamma}_\tau(t)) }{d t} \Big|_{t=0} = \frac{d h_\lambda({\gamma}_\lambda(t)) }{d t} \Big|_{t=0} \end{aligned}\tag{1}\label{1} $$

In Part II, we have shown that $$ \begin{aligned} \frac{d h_\tau({\gamma}_\tau(t)) }{d t} \Big|_{t=0} &= [\nabla h_\tau(\mathbf{\tau}_0) ]^T \frac{d {\gamma}_\tau(t) }{d t} \Big|_{t=0} \\ \frac{d h_\lambda({\gamma}_\lambda(t)) }{d t} \Big|_{t=0} & = [\nabla h_\lambda(\mathbf{\lambda}_0) ]^T \frac{d {\gamma}_\lambda(t) }{d t} \Big|_{t=0} \end{aligned} $$ where $\nabla$ is the standard (coordinate) derivative.

Recall that in Part II, we have shown that $\frac{d {\gamma}_\tau(t) }{d t} \Big|_{t=0}$ is a parametric representation of a tangent vector, which is a Riemannian gradient. Notice that $\nabla h_\lambda(\mathbf{\lambda}_0)$ is a Euclidean gradient1.

We will use the following notations to simplify expressions.

Notations Meanings
Euclidean gradient $(g_\tau)_i$ $i$-th entry under parametrization $\tau$
Riemannian gradient $(\hat{g}_\tau)^j$ $j$-th entry under parametrization $\tau$
Parameter $\tau^j$ $j$-th parameter under parametrization $\tau$

Using these notations, the derivational derivatives then can be re-expressed as

$$ \begin{aligned} \frac{d h_\tau({\gamma}_\tau(t)) }{d t} \Big|_{t=0} &= \sum_{i} (g_\tau)_i (\hat{g}_\tau)^i = \mathbf{g}_\tau^T \hat{\mathbf{g}}_\tau \\ \frac{d h_\lambda({\gamma}_\lambda(t)) }{d t} \Big|_{t=0} & =\sum_{i} (g_\lambda)_i (\hat{g}_\lambda)^i = {\mathbf{g}}_\lambda^T \hat{\mathbf{g}}_\lambda \end{aligned}\tag{2}\label{2} $$ where $\mathbf{g}_\lambda$ and $\mathbf{g}_\tau$ are Euclidean gradients (e.g., $\mathbf{g}_\tau=\nabla h_\tau(\tau_0) $) while $\hat{\mathbf{g}}_\lambda$ and $\hat{\mathbf{g}}_\tau$ are Riemannian gradients (e.g., $\hat{\mathbf{g}}_\tau=\frac{d \gamma_\tau(0) }{d t}$) .

By $\eqref{1}$ and $\eqref{2}$, we have the following identity obtained from the geometric property of directional derivatives. $$ \begin{aligned} \mathbf{g}_\tau^T \hat{\mathbf{g}}_\tau = \mathbf{g}_\lambda^T \hat{\mathbf{g}}_\lambda \end{aligned}\tag{3}\label{3} $$

Now, we discuss the parameter transformation between $\tau$ and $\lambda$, where we can express $\lambda$ in terms of $\tau$ denoted by $\lambda(\tau)$.

By the (standard) chain rule for a Euclidean gradient2, we has $$ \begin{aligned} (g_\tau)_i = \sum_{k} {\color{red} (g_\lambda)_k} \frac{ {\color{red} \partial \lambda^k(\tau) }}{ \partial \tau^i } \end{aligned} \tag{4}\label{4} $$

Let $J_{ki}:=\frac{\partial \lambda^k(\tau) }{ \partial \tau^i }$ denotes the $(k,i)$ entry of the Jacobian matrix. We illustrate our matrix notation in a 2D case as below.

$$ \begin{aligned} \begin{matrix} & \\ \mathbf{J} = \left ( \vphantom{ \begin{matrix} 12 \\ 12 \end{matrix} } \right . \end{matrix} \hspace{-1.2em} \begin{matrix} i=1 & i=2 \\ \hline J_{11} & J_{12} \\ J_{21} & J_{22} \\ \end{matrix} \hspace{-0.2em} \begin{matrix} & \\ \left . \vphantom{ \begin{matrix} 12 \\ 12 \end{matrix} } \right ) \begin{matrix} k=1 \\ k=2 \end{matrix} \end{matrix} \end{aligned}\tag{5}\label{5} $$

Eq. $\eqref{4}$ gives us the transformation rule for Eulcidean gradients (denoted by a row vector) as below in a vector form.

$$ \begin{aligned} \mathbf{g}_\tau^T = \mathbf{g}_\lambda^T \mathbf{J} \end{aligned}\tag{6}\label{6}, $$

Note: row vector ${\mathbf{g}}_\tau^T$ can be computed via a vector-Jacobian product3 in any standard Auto-Diff toolbox given that ${\mathbf{g}}_\lambda$ is pre-computed.

By Eq $\eqref{3}$, we obtain the transformation rule for Riemannian gradients (denoted by a column vector) as below, where $\mathbf{Q}:=\mathbf{J}^{-1}$ is also a Jacobian matrix and $Q_{ki}=\frac{\partial \tau^k(\lambda)}{\partial \lambda^i}$ is the $(k,i)$ entry of matrix $\mathbf{Q}$.

$$ \begin{aligned} \hat{\mathbf{g}}_\tau= \mathbf{Q} \hat{\mathbf{g}}_\lambda \end{aligned}\tag{7}\label{7} $$

Note: column vector $\hat{\mathbf{g}}_\tau$ can be computed via a (inverse) Jacobian-vector product4 used in forward-mode differentiation [1] [2] given that $\hat{\mathbf{g}}_\lambda$ is pre-computed and we can explicitly express $\tau$ in terms of $\lambda$. As shown in [2], this Jacobian-vector product can be computed by using two vector-Jacobian products in any standard Auto-Diff toolbox.

The elementwise expression of the transformation rule for Riemannian gradients is $$ \begin{aligned} (g_\tau)^k = \sum_{i} \frac{ \partial \tau^k(\lambda) }{ {\color{red} \partial \lambda^i} } {\color{red} (g_\lambda)^i} \end{aligned}, $$

Note that these transformation rules are valid when the Jacobian matrix $\mathbf{J}$ is square and non-singular. As we discussed in Part I about intrinsic parameterizations, the transformation map between $\tau$ and $\lambda$ must be bi-jective, which implies the Jacoabian matrix is square. Moreover, the map and its inverse map should be smooth, which implies that the Jacobian matrix is well-defined and non-singular.

Transformation Rule for the Fisher Information Matrix

Now, we discuss a transformation rule for the Fisher information matrix (FIM) as defined at Part I.

$$ \begin{aligned} F_{ij}(\tau) := E_{p(w|\tau) } [ \Big( \partial_{\tau_i} \log p(w|\tau ) \Big) \Big(\partial_{\tau_j} \log p(w|\tau) \Big) ] \end{aligned} $$ Since $ \log p(w|\tau )$ can be considered as a scalar function $h$ defined on the manifold for any valid $w$, we have $$ \begin{aligned} \log p(w|\tau_0 ) = h_\tau(\tau_0) = h_\lambda(\lambda_0) = \log p(w|\lambda_0 ) \end{aligned} $$

Thus, the FIM can be computed as $$ \begin{aligned} F_{ij}(\tau_0) &= E_{p(w|\tau_0) } [ \Big( \partial_{\tau_i} \log p(w|\tau_0 ) \Big) \Big(\partial_{\tau_j} \log p(w|\tau_0) \Big) ] \\ &= E_{p(w|\lambda_0) } [ \Big( \partial_{\tau_i} \log p(w|\tau_0 ) \Big) \Big(\partial_{\tau_j} \log p(w|\tau_0) \Big) ]\\ \end{aligned} $$

Recall that by the standard chain rule, we have $$ \begin{aligned} \partial_{\tau_i} \log p(w|\tau_0 ) = \sum_k \frac{ \partial \lambda^k(\tau_0) }{ \partial \tau^i } \Big( \partial_{\lambda_k} \log p(w|\lambda_0 ) \Big) \end{aligned} $$

Moreover, the Jacobian matrix does not depent on $w$. Therefore, we have $$ \begin{aligned} F_{ij}(\tau_0) &= E_{p(w|\lambda_0) } [ \Big( \partial_{\tau_i} \log p(w|\tau_0 ) \Big) \Big(\partial_{\tau_j} \log p (w|\tau_0) \Big) ]\\ &= E_{p(w|\lambda_0) } [ \Big( \sum_k \frac{ \partial \lambda^k(\tau_0) }{ \partial \tau^i } \partial_{\lambda_k} \log p(w|\lambda_0 ) \Big) \Big( \sum_l \frac{ \partial \lambda^l(\tau_0) }{ \partial \tau^j } \partial_{\lambda_l} \log p(w|\lambda_0 ) \Big) ] \\ &= \sum_k \sum_l \frac{ \partial \lambda^k(\tau_0) }{ \partial \tau^i } \frac{ \partial \lambda^l(\tau_0) }{ \partial \tau^j } E_{p(w|\lambda_0) } [ \Big( \partial_{\lambda_k} \log p(w|\lambda_0 ) \Big) \Big( \partial_{\lambda_l} \log p(w|\lambda_0 ) \Big) ] \\ &= \sum_k \sum_l \frac{ \partial \lambda^k(\tau_0) }{ \partial \tau^i } \frac{ \partial \lambda^l(\tau_0) }{ \partial \tau^j } F_{kl}(\lambda_0) \end{aligned} $$

We can re-express the above expression in a matrix form as below. This is the transformation rule for the FIM.

$$ \begin{aligned} \mathbf{F}_{\tau} (\tau_0) = \underbrace{\mathbf{J}^T}_{ \frac{ \partial \lambda^i(\tau_0) }{ \partial \tau^k } } \mathbf{F}_{\lambda} (\lambda_0) \underbrace{\mathbf{J}}_{ \frac{ \partial \lambda^l(\tau_0) }{ \partial \tau^j } } \end{aligned}\tag{8}\label{8} $$

By using this transformation rule, we can show that another geometric property: the length of a Riemannian vector is preserved.

We can see that the length of a Riemannian vector is also invariant. $$ \begin{aligned} \| \hat{\mathbf{g}}_\tau \|^2_{F_{\tau_0}} &= [\hat{\mathbf{g}}_\tau]^T \mathbf{F}_{\tau} (\tau_0) \hat{\mathbf{g}}_\tau \\ &= [\mathbf{J}^{-1} \hat{\mathbf{g}}_\lambda]^T \mathbf{F}_{\tau} (\tau_0) \mathbf{J}^{-1} \hat{\mathbf{g}}_\lambda \\ &= [\hat{\mathbf{g}}_\lambda]^T [ \mathbf{J}^{-T} \mathbf{F}_{\tau} (\tau_0) \mathbf{J}^{-1} ] \hat{\mathbf{g}}_\lambda \\ &= [\hat{\mathbf{g}}_\lambda]^T \mathbf{F}_{\lambda} (\lambda_0) \hat{\mathbf{g}}_\lambda = \| \hat{\mathbf{g}}_\lambda \|^2_{F_{\lambda_0}} \end{aligned}\tag{9}\label{9} $$

Riemannian Steepest Direction is Invariant


Now, we can show that the optimal solution of Riemannian steepest direction considered in Part II under parametrization $\tau$ and $\lambda$ are equivalent since both the length and the directional derivative remain the same.

Denote Euclidean gradients as $\mathbf{g}_\lambda:= \nabla f_\lambda(\mathbf{\lambda}_0) $ and $\mathbf{g}_\tau:= \nabla f_\tau(\mathbf{\tau}_0) = \nabla f_\lambda(\mathbf{\lambda}(\tau_0)) $, which follows the parameter transformation rule in $\eqref{6}$.

Now, consider natural/Riemannian gradients as $\hat{\mathbf{g}}_\lambda:= \mathbf{F}_{\lambda}^{-1}(\mathbf{\lambda}_0) \mathbf{g}_\lambda $ and $\hat{\mathbf{g}}_\tau:= \mathbf{F}_{\tau}^{-1}(\mathbf{\tau}_0) \mathbf{g}_\tau $. These Riemannian gradients follow the parameter transformation rule in $\eqref{7}$ as shown below.

$$ \begin{aligned} \hat{\mathbf{g}}_\tau &= \mathbf{F}_{\tau}^{-1}(\mathbf{\tau}_0) \mathbf{g}_\tau \\ &= \big( \mathbf{J}^T \mathbf{F}_{\lambda} (\mathbf{\lambda}_0) \mathbf{J} \big)^{-1} ( \mathbf{g}^T_\tau )^T & ( \text{by } \eqref{8} )\\ &= \mathbf{J}^{-1} \mathbf{F}_{\lambda}^{-1} (\mathbf{\lambda}_0) \mathbf{J}^{-T} ( \mathbf{g}^T_\lambda \mathbf{J} )^T & ( \text{by } \eqref{6} ) \\ &= \mathbf{J}^{-1} \mathbf{F}_{\lambda}^{-1} (\mathbf{\lambda}_0) \mathbf{J}^{-T} ( \mathbf{J}^T \mathbf{g}_\lambda ) \\ &= \mathbf{J}^{-1} \mathbf{F}_{\lambda}^{-1} (\mathbf{\lambda}_0) \mathbf{g}_\lambda \\ &= \mathbf{J}^{-1} \hat{\mathbf{g}}_\lambda \end{aligned} $$

Recall that the optimal solution of the Riemannian steepest direction is $$ \begin{aligned} \mathbf{v}_{\lambda}^{(opt)}= -\frac{ \mathbf{F_\lambda}^{-1}(\mathbf{\lambda}_0) \nabla_\lambda f(\mathbf{\lambda}_0) }{\| \mathbf{F_\lambda}^{-1}(\mathbf{\lambda}_0)\nabla_\lambda f(\mathbf{\lambda}_0) \|_{F_{\lambda_0}}} = -\frac{\hat{\mathbf{g}}_\lambda}{\|\hat{\mathbf{g}}_\lambda\|_{ F_{\lambda_0} } } \\ \mathbf{v}_{\tau}^{(opt)}= -\frac{ \mathbf{F_\tau}^{-1}(\mathbf{\tau}_0) \nabla_\tau f(\mathbf{\tau}_0) }{\| \mathbf{F_\tau}^{-1}(\mathbf{\tau}_0)\nabla_\tau f(\mathbf{\tau}_0) \|_{F_{\tau_0}}} = -\frac{\hat{\mathbf{g}}_\tau}{\|\hat{\mathbf{g}}_\tau\|_{ F_{\tau_0} } } \end{aligned} $$

We can easily verify the following identities since $ \|\hat{\mathbf{g}}_\lambda\|_{ F_{\lambda_0} } = \|\hat{\mathbf{g}}_\tau\|_{ F_{\tau_0} } $ as shown in $\eqref{9}$. $$ \begin{aligned} \mathbf{g}_\tau^T \mathbf{v}^{(opt)}_{\tau} & = \mathbf{g}_\lambda^T \mathbf{v}^{(opt)}_{\lambda} & \,\,\, \text{(invariance of a directional derivative)} \\ \| \mathbf{v}^{(opt)}_{\tau} \|^2_{\color{red}{F_{\tau_0}}} & = \| \mathbf{v}^{(opt)}_{\lambda} \|^2_{\color{red}{F_{\lambda_0}} } & \,\,\, \text{(invariance of the length of the Riemannian steepest direction)} \\ \mathbf{v}^{(opt)}_{\tau} & = \mathbf{J}^{-1} \mathbf{v}^{(opt)}_{\lambda} & \,\,\, \text{(transformation rule for the Riemannian steepest direction)} \end{aligned}\tag{10}\label{10}. $$ In other words, the Riemannian steepest direction (which is indeed a normalized Riemannian gradient) is also transformed according to the transformation rule for Riemannian gradients in $\eqref{7}$.

As we will discuss in Part IV, this invariance property implies that natural-gradient descent is linearly invariant.

Euclidean Steepest Direction is NOT Invariant


Recall that we have shown that a unit Euclidean gradient is the optimal solution of Euclidean steepest direction in Part II.

We can show that the (standard) length of a Euclidean gradient is NOT invariant under a parameter transformation due to the Jacobian matrix (i.e., $\| \mathbf{g}_\tau \| \neq \| \mathbf{g}_\lambda \|$). This is a reason why we use the weighted inner product to define the length of a gradient vector.

Note that we denote the Euclidean gradients as $\mathbf{g}_\lambda:= \nabla f_\lambda(\mathbf{\lambda}_0) $ and $\mathbf{g}_\tau:= \nabla f_\tau(\mathbf{\tau}_0) = \nabla f_\lambda(\mathbf{\lambda}(\tau_0)) $

Recall that the optimal solution of the Euclidean steepest direction is $$ \begin{aligned} \mathbf{u}_{\lambda}^{(opt)}= -\frac{\nabla_\lambda f_\lambda(\mathbf{\lambda}_0) }{\|\nabla_\lambda f_\lambda(\mathbf{\lambda}_0) \|} = -\frac{\mathbf{g}_\lambda}{\|\mathbf{g}_\lambda\|} \\ \mathbf{u}_{\tau}^{(opt)}= -\frac{\nabla_\tau f_\tau(\mathbf{\tau}_0) }{\|\nabla_\tau f_{\tau}(\mathbf{\tau}_0) \|} = -\frac{\mathbf{g}_\tau}{\|\mathbf{g}_\tau\|} \end{aligned} $$

Unfortunately, the Euclidean steepest direction does NOT obey the parameter transformation rule for Euclidean gradients in $\eqref{6}$. $$ \begin{aligned} (\mathbf{u}_{\tau}^{(opt)})^T \neq (\mathbf{u}_{\lambda}^{(opt)})^T \mathbf{J} \end{aligned} $$

Moreover, the optimal value of the Euclidean steepest direction is NOT invariant under a parameter transformation as $$ \begin{aligned} \mathbf{g}_\lambda^T \mathbf{u}^{(opt)}_{\lambda} = - \|\mathbf{g}_\lambda\| \neq - \|\mathbf{g}_\tau\| = \mathbf{g}_\tau^T \mathbf{u}^{(opt)}_{\tau} \end{aligned} $$

In summary, the Euclidean steepest direction (which is a normalized Euclidean gradient) is NOT transformed according to the transformation rule for a Euclidean gradient. Moreover, Euclidean gradient descent is not invariant under a parameter transformation. We will cover more about this in Part IV.


References

[1] W. Lin, F. Nielsen, M. E. Khan, & M. Schmidt, "Tractable structured natural gradient descent using local parameterizations," International Conference on Machine Learning (ICML) (2021).

[2] H. Salimbeni, S. Eleftheriadis, & J. Hensman, "Natural gradients in practice: Non-conjugate variational inference in Gaussian process models," International Conference on Artificial Intelligence and Statistics (PMLR, 2018), pp. 689–697.

Footnotes:

  1. In differential geometry, a Euclidean gradient is also known as a coordinate representation of a cotangent vector. 

  2. We assume readers are familar with the transformation rule for Euclidean gradients. Let $\tau(t)=\gamma_\tau(t)$ and $\lambda(t)=\gamma_\lambda(t)$. In differential geometry, this transformation rule can also be shown by using the following identity: $\mathbf{J}(\tau)= \frac{ \partial [\Psi_\lambda \circ \Psi_\tau^{-1}](\tau) }{\partial \tau } = \frac{ \partial \lambda }{\partial \tau } $, where we drop the index $t$ and recall that $\gamma_\tau(t)=\Psi_\tau \circ \gamma(t)$ 

  3. Consider vector function $\lambda(\tau)$ from input space $\tau$ to output space $\lambda$. The linearization of function $\lambda(\tau)$ is a linear map defined by Jacobian matrix $J$. This linear map is known as the pullback map of $\lambda(\tau)$ in differential geometry. Given (Euclidean) vector $g_\lambda$ in the output space, the Euclidean transformation rule (the standard chain rule) returns (Euclidean) vector $g_\tau$ in the input space by the linearization. 

  4. Recall that function $\tau(\lambda)$ is the inverse map of function $\lambda(\tau)$. Consider vector function $\tau(\lambda)$ from input space $\lambda$ to output space $\tau$. The linearization of function $\tau(\lambda)$ is a linear map defined by Jacobian matrix $Q=J^{-1}$. This linear map is known as the pushforward map of $\tau(\lambda)$ in differential geometry. Given (Riemannian) vector $\hat{g}_\lambda$ in the input space, the Riemannian transformation rule returns (Riemannian) vector $\hat{g}_\tau$ in the output space by the linearization.