Part IV: Natural Gradient Descent and its Extension—Riemannian Gradient Descent
Warning: working in Progress (incomplete)
Goal (edited: 07-Jan-23)
This blog post should help readers to understand natural-gradient descent and Riemannian gradient descent. We also discuss some invariance property of natural-gradient descent, Riemannian gradient descent, and Newton’s method.
We will give an informal introduction with a focus on high level of ideas.
Click to see how to cite this blog post
Two kinds of Spaces
We will discuss (Riemannian) gradient spaces and parameter spaces for gradient-based updates.
As we disucssed in Part II, the parameter space
In manifold cases, we have to explicitly distinguish the difference between the representation of a point (parameter) and a vector (Riemannian gradient). These two spaces are different in many aspects such as the domain and the distance. Mathematically speaking, a (Riemannian) gradient space is much simpler and nicer than a parameter space.
Natural-gradient Descent in an Intrinsic Parameter Space
Using intrinsic parametrization
where
This update in Eq.
The update in the parameter space is valid since the parameter space
Warning:
Using a small step-size could be an issue since it could greatly slow down the progression of natural-gradient descent in practice.
Example of NGD in a constrained space: (click to expand)
Natural-gradient Descent is Linearly Invariant
Recall that in Part III, we have shown that natural gradients are invariant under any intrinsic parameter transformation. The parameter transformation can be non-linear.
It is natural to expect that natural-gradient descent has a similar property. Unfortunately, natural-gradient descent is only invariant under an intrinsic linear transformation. Note that Newton’s method is also linearly invariant while Euclidean gradient descent is not.
Let’s consider the following (scalar) optimization problem on a smooth manifold
Note that
Natural gradient descent in this parameter space
Consider another intrinsic parameterization
Natural gradient descent in this parameter space
Recall that we have the transformation rule for natural gradients as
We can verify that
When
Riemannian Gradient Descent and its (Non-linear) Invariance
Now we discuss a gradient-based method that is invariant to any intrinsic parameter transformation.
As mentioned before, a manifold in general does not have a vector-space structure.
We has to artificially choose an intrinsic parameterization
We will first introduce the concept of a (one-dimensional) geodesic
We will only consider the IVP of a geodesic for simplicity.
Consider an intrinsic parametrization
For a geodesically complete manifold, the domain of the geodesic is the whole space
To avoid explicitly writing down the differential equations of a geodesic2 (i.e., Christoffel symbols or connection 1-form in a section),
researchers refer to it as the manifold exponential map.
Given an intrinsic parametrization
Warning:
The exponential map is a geometric object. We use a parametric representation of a geodesic curve to define the exponential map. Thus, the form/expression of the map does depend on the choice of parametrizations.
Now, we can define a Riemannian gradient method.
Under intrinsic parametrization
where
The invariance of this update is due to the uniqueness of the geodesic and the invariance of the Euler-Lagrange equation. We will not discuss this further in this post to avoid complicated derivations. We will cover this in future posts about calculus of variations.
Although Riemannian gradient descent is nice, the exponential map or the geodesic often has a high computational cost and does not admit a closed-form expression.
Many faces of Natural-gradient Descent
Natural-gradient Descent as Inexact Riemannian Gradient Descent
Natural-gradient descent can be derived from a first-order (linear) approximation of the geodesic, which implies that natural-gradient descent is indeed an inexact Riemannian gradient update. Natural-gradient descent is linearly invariant due to the approximation.
Consider a first-order Taylor approximation at
Warning:
This approximation does not guarantee that the approximated geodesic stays on the manifold for all
Recall that the exponential map is defined via the geodesic
Therefore, an inexact Riemannian gradient update with this new map is defined as
Natural-gradient Descent as Unconstrained Proximal-gradient Descent
In this section, we will make an additional assumption:
Additional assumption:
The parameter space
As we mentioned before, the distances in the gradient space and the parameter space are defined differently. In the previous section, we use the geodesic to define the distance between two points in a parameter space.
We could also use other “distances” denoted by
In the following section, we assume
We define a class of f-divergence in this case. Note that a f-divergence can be defined in a more general setting.
Csiszar f-divergence
Let
- convex in its domain
A f-divergence for a parametric family
The KL divergence is a f-divergence where
Proximal-gradient descent
Given such a “distance”
Claim:
The secord-order Taylor approximation of any f-divergence
Proof of the claim: (click to expand)
By the claim, when
It is easy to see that we can obtain the following natural-gradient update from the above expression.
Note
The Taylor expansion
- uses the distance defined in a (Riemannian) gradient space to approximate the distance in a parameter space.
- implicitly assumes that the parameter space
is open due to the definition of the expansion.
In Part V , we will show that
Warning:
The connection bewteen natural-gradient descent and proximal-gradient/mirror descent could break down in constrained cases unless
Natural-gradient Descent in Non-intrinsic Parameter Spaces
As mentioned in Part I, an intrinsic parametrization creates a nice parameter space (e.g., a local vector space structure) and guarantees a non-singular FIM. We now discuss issues when it comes to natural-gradient descent over non-intrinsic parametrizations including overparameterization.
-
We may not have a local vector space structure in a non-intrinsic parameter space. Therefore, natural-gradient descent in this parameter space is pointless since the updated parameter will leave the parameter space. Indeed, the FIM could also be ill-defined in such cases. We will illustrate this by examples.
Bernoulli Example: Invalid NGD (click to expand)
Von Mises–Fisher Example: Invalid NGD (click to expand)
- The FIM is singular in a non-intrinsic space. In theory, Moore–Penrose inverse could be used to compute natural-gradients so that natural-gradient descent is linearly invariant in this case. However, Moore–Penrose inverse often has to use the singular value decomposition (SVD) and destroies structures of the FIM. In practice, the iteration cost of Moore–Penrose inverse is very high as illustrated in the following example.
Example: High iteration cost (click to expand)
- It is tempting to approximate the singular FIM by an emprical FIM with a scalar damping term and use Woodbury matrix identity to reduce the iteration cost of computing natural-gradients. However, sample-based emprical approximations could be problematic [4]. Moreover, damping introduces an additional tuning hyper-parameter and destories the linear invariance property of natural-gradient descent. Such an approximation should be used with caution.
Euclidean Gradient Descent is NOT (Linearly) Invariant
For simplicity, consider an unconstrained optimization problem.
Euclidean gradient descent (GD) in parametrization
Consider a reparametrization
The Euclidean gradient descent (GD) in parametrization
Note that Euclidean gradients follow the transformation rule as
We can verify that
Notice that
It is easy to see that
updates in
Newton’s Method is Linearly Invariant
For simplicity, consider an unconstrained convex optimization problem.
Newton’s method under parametrization
Consider a reparametrization
Newton’s method under parametrization
As we discussed in the previous section,
Euclidean gradients follow the
transformation rule as
Surprisingly, for a linear transformation, the Hessian follows the transformation rule like the FIM as
Therefore, the direction in Newton’s method denoted by
The consequence is that Newton’s method like natural-gradient descent is linearly invariant.
The Hessian is not a valid manifold metric
The Hessian
Contrastingly, the FIM is a valid manifold metric. Recall that the FIM can also be computed
as
Claim:
The FIM follows the transformation rule even in non-linear cases.
Proof of the claim (click to expand)
We will discuss in
Part V, for a special parametrization
In the literature, the exponential family with a natural parametrization is known as a Hessian manifold [5], where the FIM under this kind of parametrization is called a Hessian metric. However, a non-linear reparametrization will lead to a non-natural parametrization.
References
[1] S.-I. Amari, "Natural gradient works efficiently in learning," Neural computation 10:251–276 (1998).
[2] C. Atkinson & A. F. S. Mitchell, "Rao’s distance measure," Sankhyā: The Indian Journal of Statistics, Series A 345–365 (1981).
[3] M. E. Khan, R. Babanezhad, W. Lin, M. Schmidt, & M. Sugiyama, "Faster stochastic variational inference using Proximal-Gradient methods with general divergence functions," Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence (2016), pp. 319–328.
[4] F. Kunstner, P. Hennig, & L. Balles, "Limitations of the empirical Fisher approximation for natural gradient descent," Advances in Neural Information Processing Systems 32:4156–4167 (2019).
[5] H. Shima, The geometry of Hessian structures (World Scientific, 2007).
Footnotes:
-
Informally, we locally identify the parameter space with the tanget vector space. However, not all properties are the same in these two spaces. For example, the weighted inner product induced by the Riemannian metric is only defined in the vector space. We often do not use the same weighted inner product in the parameter space. ↩
-
In Riemannian geometry, a geodesic is induced by the Levi-Civita connection. This connection is known as the metric compatiable parallel transport. Christoffel symbols are used to represent the connection in a coordinate/parametrization system. ↩
-
This is due to the non-zero partial derivatives of the Jacobian matrix (Eq.
) when it comes to non-linear parameter transformations. These partial derivatives are also related to Christoffel symbols or Levi-Civita connection coefficients. ↩