Part IV: Natural Gradient Descent and its Extension—Riemannian Gradient Descent

19 minute read

Warning: working in Progress (incomplete)

Goal (edited: 07-Jan-23)

This blog post should help readers to understand natural-gradient descent and Riemannian gradient descent. We also discuss some invariance property of natural-gradient descent, Riemannian gradient descent, and Newton’s method.

We will give an informal introduction with a focus on high level of ideas.

Click to see how to cite this blog post
@misc{lin2021NGDblog04,
  title = {Introduction to Natural-gradient Descent: Part IV},
  author = {Lin, Wu and Nielsen, Frank and Khan, Mohammad Emtiyaz and Schmidt, Mark},
  url = {https://informationgeometryml.github.io/year-archive/}, 
  howpublished = {\url{https://informationgeometryml.github.io/posts/2021/11/Geomopt04/}},
  year = {2021},
  note = {Accessed: 2021-11-15}
}

Two kinds of Spaces


We will discuss (Riemannian) gradient spaces and parameter spaces for gradient-based updates.

As we disucssed in Part II, the parameter space Ωτ and the tangent space denoted by Tτ0M at point τ0 are two distinct spaces. Given intrinsic parametrization τ, the tangent space is a (complete) vector space and Tτ0M=RK while the parameter space Ωτ is like a local vector space in RK, where K is the dimension of the manifold. In other words, Ωτ is often an open (proper) subset of RK.

In manifold cases, we have to explicitly distinguish the difference between the representation of a point (parameter) and a vector (Riemannian gradient). These two spaces are different in many aspects such as the domain and the distance. Mathematically speaking, a (Riemannian) gradient space is much simpler and nicer than a parameter space.

Natural-gradient Descent in an Intrinsic Parameter Space


Using intrinsic parametrization τ, we can perform a natural-gradient update known as natural-gradient descent (NGD) [1] when we use the Fisher-Rao metric.

(1)NGD:τk+1τkαg^τk

where g^τk:=v(τk) is a natural gradient evaluated at point τk and α:=t>0 is a step-size.

This update in Eq. (1) is inspired by the standard linear update in the tangent vector space (see the green space in the above figure). By choosing an intrinsic parametrization, this update is also valid in the parameter space1 (see the red space in the above figure) as long as the step-size is small enough.

The update in the parameter space is valid since the parameter space Ωτ has a local vector-space structure thanks to the use of an intrinsic parametrization. However, when Ωτ is a proper subset of TτkM (i.e., ΩτTτkM), the update in Eq. (1) is valid only when the step-size α is small enough so that τk+1Ωτ.

Warning:

Using a small step-size could be an issue since it could greatly slow down the progression of natural-gradient descent in practice.

Example of NGD in a constrained space: (click to expand)

Consider a univariate Gaussian family {N(w|μ,v)|μR,v>0} with mean μ, variance v, and intrinsic parameter τ=(μ,v).

We have to properly select the step-size α for natural-gradient descent in (1) due to the positivity constraint in v.

In multivariate Gaussian cases, we have to handle a positive-definite constraint.

Natural-gradient Descent is Linearly Invariant


Recall that in Part III, we have shown that natural gradients are invariant under any intrinsic parameter transformation. The parameter transformation can be non-linear.

It is natural to expect that natural-gradient descent has a similar property. Unfortunately, natural-gradient descent is only invariant under an intrinsic linear transformation. Note that Newton’s method is also linearly invariant while Euclidean gradient descent is not.

Let’s consider the following (scalar) optimization problem on a smooth manifold M with the Fisher-Rao metric F. (2)minxMh(x)

Note that M in general does not have a vector-space structure. We has to artificially choose an intrinsic parameterization τ so that the parameter space Ωτ at least has a local vector-space structure. The problem in (2) can be re-expressed as below. minτΩτhτ(τ) where hτ is the parameter representation of the smooth function h.

Natural gradient descent in this parameter space Ωτ is (3)τk+1τkαg^τk where Fλ is the FIM, natural gradient is g^τk:=[Fτ(τk)]1τhτ(τk) and the step-size α is small enough so that τk+1Ωτ.

Consider another intrinsic parameterization λ so that λ=Uτ, where U is a constant (square) invertible matrix. For simplicity, we further assume Ωλ={Uτ|τΩτ}, where Ωλ is the parameter space of λ.

Natural gradient descent in this parameter space Ωλ is (4)λk+1λkαg^λk where g^λk:=[Fλ(λk)]1λhλ(λk) and the cost function is hλ(λ)=hτ(τ(λ)).

Recall that we have the transformation rule for natural gradients as g^τ=Qg^λ where Qji=τj(λ)λi.

We can verify that Q=U1. Notice that τ0=U1λ0 by construction. The update in (3) at iteration k=1 then can be re-expressed as τ1τ0αg^τ0=U1λ0αU1g^λ0=U1λ1

When α is small enough, we have τ1Ωτ and λ1Ωλ. It is easy to show that τk=U1λk by induction. Therefore, updates in (3) and (4) are equivalent.

Riemannian Gradient Descent and its (Non-linear) Invariance


Now we discuss a gradient-based method that is invariant to any intrinsic parameter transformation.

As mentioned before, a manifold in general does not have a vector-space structure. We has to artificially choose an intrinsic parameterization τ, which gives rise to a parametrization-dependence. Therefore, it will be ideal if a gradient-based method does not dependent on the choice of intrinsic parametrizations.

We will first introduce the concept of a (one-dimensional) geodesic γ(t), which is the “shortest curve” on a manifold with a Riemannian metric (i.e., the Fisher-Rao metric). Recall that in Part II we only define a distance between two Riemannian gradients evaluated at the same point. We can use the length of the shortest geodesic to define the distance between two points on the manifold. In statistics, the distance induced by a geodesic with the Fisher-Rao metric is known as the Rao distance [2]. This is known as the boundary value problem (BVP) of a geodesic. The boundary conditions are specified by a starting point and an end point on the manifold. To solve this problem, we often solve an easier problem, which is known as the initial value problem (IVP) of a geodesic. The initial conditions are specified by a starting point and an initial Riemannian gradient/velocity.

We will only consider the IVP of a geodesic for simplicity. Consider an intrinsic parametrization τ, where γτ(t) is the parameter representation of the geodesic. To specify a geodesic, we need to provide a starting point τ0 on the manifold and a Riemannian gradient vτ0 evluated at point τ0. The geodesic is the solution of a system of second-order non-linear ordinary differential equations (ODE) with the following initial conditions. γτ(0)=τ0;dγτ(t)dt|t=0=vτ0 where the geodesic is uniquely determined by the initial conditions and Iτ is the domain of the parametric geodesic curve. We assume 0 is contained in the domain for simplicity.

For a geodesically complete manifold, the domain of the geodesic is the whole space Iτ=R. We will only consider this case in this post.

To avoid explicitly writing down the differential equations of a geodesic2 (i.e., Christoffel symbols or connection 1-form in a section), researchers refer to it as the manifold exponential map. Given an intrinsic parametrization τ, the map at point τ0 is defined as Expτ0:Tτ0MMvτ0γτ(1)s.t.γτ(0)=τ0;dγτ(t)dt|t=0=vτ0 where t is fixed to be 1 and τ0 denotes the initial point under parametrization τ.

Warning:

The exponential map is a geometric object. We use a parametric representation of a geodesic curve to define the exponential map. Thus, the form/expression of the map does depend on the choice of parametrizations.

Now, we can define a Riemannian gradient method. Under intrinsic parametrization τ of a manifold, (exact) Riemannian gradient descent (RGD) is defined as

RGD:τk+1Expτk(αg^τk)

where τk+1 always stays on the manifold thanks to the exponential map since g^τk is in the domain of the exponential map.

The invariance of this update is due to the uniqueness of the geodesic and the invariance of the Euler-Lagrange equation. We will not discuss this further in this post to avoid complicated derivations. We will cover this in future posts about calculus of variations.

Although Riemannian gradient descent is nice, the exponential map or the geodesic often has a high computational cost and does not admit a closed-form expression.

Many faces of Natural-gradient Descent


Natural-gradient Descent as Inexact Riemannian Gradient Descent

Natural-gradient descent can be derived from a first-order (linear) approximation of the geodesic, which implies that natural-gradient descent is indeed an inexact Riemannian gradient update. Natural-gradient descent is linearly invariant due to the approximation.

Consider a first-order Taylor approximation at t=0 of the geodesic shown below. γτ(t)γτ(0)+dγτ(t)dt|t=0(t0)

Warning:

This approximation does not guarantee that the approximated geodesic stays on the manifold for all t0.

Recall that the exponential map is defined via the geodesic γτ(1). We can use this approximation of the geodesic to define a new map (A.K.A. the Euclidean retraction map) as shown below. Retτ0(vτ0):=γτ(0)+dγτ(t)dt|t=0(10)=τ0+vτ0

Therefore, an inexact Riemannian gradient update with this new map is defined as τk+1Retτk(αg^τk)=τkαg^τk which recovers natural-gradient descent.

Natural-gradient Descent as Unconstrained Proximal-gradient Descent

In this section, we will make an additional assumption:

Additional assumption:

The parameter space Ωτ=RK is unconstrained.

As we mentioned before, the distances in the gradient space and the parameter space are defined differently. In the previous section, we use the geodesic to define the distance between two points in a parameter space.

We could also use other “distances” denoted by D(.,.) (i.e., Kullback–Leibler divergence) [3] to define the length between two points in a parameter space.

In the following section, we assume p(w|y) and p(w|τk) are two members in a parameteric distribution family p(w|τ) indexed by y and τk, respectively.

We define a class of f-divergence in this case. Note that a f-divergence can be defined in a more general setting.

Csiszar f-divergence


Let f:(0,+)R be a smooth scalar function satisfying all the following conditions.

  • convex in its domain
  • f(1)=0
  • f¨(1)=1

A f-divergence for a parametric family p(w|τ) is defined as Df(y,τk):=f(p(w|y)p(w|τk))p(w|τk)dw. Thanks to Jensen’s inequality, a f-divergence is always non-negative as
Df(y,τk)f(p(w|y)p(w|τk)p(w|τk)dw)=f(p(w|y)dw)=f(1)=0.

The KL divergence is a f-divergence where f(t)=log(t).

Proximal-gradient descent


Given such a “distance” D(y,τk), we could perform an unconstrained proximal-gradient update as τk+1argminyRK{gτk,y+1αD(y,τk)} where gτk is a Eulcidean gradient and the parameter space Ωτ is unconstrained.

Claim:

The secord-order Taylor approximation of any f-divergence Df(y,τk) at y=τk is 12(yτk)TFτ(τk)(yτk)

Proof of the claim: (click to expand)

Proof:

We will show that the second-order Talor approximation denoted by D(y,τk) can be expressed as below. D(y,τk):=Df(τk,τk)=0+(yτk)T[yDf(y,τk)|y=τk]=0+12(yτk)T[y2Df(y,τk)|y=τk]=Fτ(τk)(yτk)

For the zero-order term, we have Df(τk,τk)=f(p(w|τk)p(w|τk))p(w|τk)dw=f(1)=0p(w|τk)dw=0.

For the first-order term, we have yDf(y,τk)=yf(p(w|y)p(w|τk))p(w|τk)dw=f˙(p(w|y)p(w|τk))yp(w|y)p(w|τk)p(w|τk)dw=f˙(p(w|y)p(w|τk))yp(w|y)dw.

Note that when y=τk, we can simplify the expression since f˙(p(w|y)p(w|τk))=f˙(1) is a constant. Therefore, yDf(y,τk)|y=τk=f˙(1)yp(w|y)|y=τkdw=f˙(1)y[p(w|y)dw]=1|y=τk=0.

For the second-order term, we have y2Df(y,τk)=y2f(p(w|y)p(w|τk))p(w|τk)dw=y[f˙(p(w|y)p(w|τk))yp(w|y)p(w|τk)]p(w|τk)dw=[f¨(p(w|y)p(w|τk))yp(w|y)p(w|τk)yTp(w|y)p(w|τk)+f˙(p(w|y)p(w|τk))y2p(w|y)p(w|τk)]p(w|τk)dw Note that when y=τk, we can simplify the expression since f¨(p(w|y)p(w|τk))=f¨(1)=1 and f˙(p(w|y)p(w|τk))=f˙(1) is a constant.

Therefore, y2Df(y,τk)|y=τk=[f¨(p(w|y)p(w|τk))yp(w|y)p(w|τk)yTp(w|y)p(w|τk)+f˙(p(w|y)p(w|τk))y2p(w|y)p(w|τk)]p(w|τk)dw|y=τk=yp(w|y)p(w|τk)yTp(w|y)p(w|τk)p(w|τk)dw|y=τkTerm I+f˙(1)y2p(w|y)p(w|τk)p(w|τk)dw|y=τkTerm II

Term II is zero since f˙(1)y2p(w|y)p(w|τk)p(w|τk)dw|y=τk=f˙(1)y2p(w|y)dw|y=τk=f˙(1)y2[p(w|y)dw]=1|y=τk=0

Term I is the FIM since yp(w|y)p(w|τk)=ylogp(w|y)yTp(w|y)p(w|τk)p(w|τk)dw|y=τk=Ep(w|τk)[ylogp(w|y)yTlogp(w|y)]|y=τk=Ep(w|τk)[τlogp(w|τk)τTlogp(w|τk)]=Fτ(τk)

By the claim, when D(y,τk) is the second-order Taylor approximation of a f-divergence Df(y,τk) at y=τk, the unconstrained proximal-gradient update can be re-expressed as τk+1argminyRK{gτk,y+12α(yτk)TFτ(τk)(yτk)}

It is easy to see that we can obtain the following natural-gradient update from the above expression. τk+1τkαFτ1(τk)gτk=g^τk

Note

The Taylor expansion

  • uses the distance defined in a (Riemannian) gradient space to approximate the distance in a parameter space.
  • implicitly assumes that the parameter space Ωτ is open due to the definition of the expansion.

In Part V , we will show that D(y,τk) can also be an exact KL divergence when p(w) is an exponential family. Under a particular parametrization, natural-gradient descent also can be viewed as (unconstrained) mirror descent.

Warning:

The connection bewteen natural-gradient descent and proximal-gradient/mirror descent could break down in constrained cases unless α is small enough. We will cover more about this point in Part VI.

Natural-gradient Descent in Non-intrinsic Parameter Spaces


As mentioned in Part I, an intrinsic parametrization creates a nice parameter space (e.g., a local vector space structure) and guarantees a non-singular FIM. We now discuss issues when it comes to natural-gradient descent over non-intrinsic parametrizations including overparameterization.

  1. We may not have a local vector space structure in a non-intrinsic parameter space. Therefore, natural-gradient descent in this parameter space is pointless since the updated parameter will leave the parameter space. Indeed, the FIM could also be ill-defined in such cases. We will illustrate this by examples.

    Bernoulli Example: Invalid NGD (click to expand)

    Consider Bernoulli family {I(w=0)π0+I(w=1)π1|π0>0,π1>0,π0+π1=1} with parameter τ=(π0,π1).

    As we shown in Part I, the FIM is ill-defined due to this eqaulity constraint.

    Moreover, the NGD update will violate the eqaulity constraint.

    Von Mises–Fisher Example: Invalid NGD (click to expand)

    Consider 2-dimensional Von Mises–Fisher family {p(w|τ):=C(κ)exp(κ(w1μ1+w2μ2))|κ>0,μ12+μ22=1} with parameter τ=(κ,μ1,μ2), where C(κ) is the normalization constant, κ is a positive scalar, w=(w1,w2) is a random unit vector defined in a circle, and μ=(μ1,μ2) is also a unit vector.

    We can show that the FIM is ill-defined under this parametrization due to this eqaulity constraint.

  2. The FIM is singular in a non-intrinsic space. In theory, Moore–Penrose inverse could be used to compute natural-gradients so that natural-gradient descent is linearly invariant in this case. However, Moore–Penrose inverse often has to use the singular value decomposition (SVD) and destroies structures of the FIM. In practice, the iteration cost of Moore–Penrose inverse is very high as illustrated in the following example.
    Example: High iteration cost (click to expand)

    Consider a d-dimensional Gaussian mixture family {1Ck=1CN(w|μk,Σk)|μkRd,Σk0} with τ={μk,Σk}k=1C.

    If we use the following initialization such that all C components have the same mean μ0 and the same covariance Σ0, this family becomes a Gaussian family. In this case, τ is a non-intrinsic parameterization for a Gaussian family. Note that the FIM is singular under parametrization τ. The iteration cost of natural-gradient descent in this parameter space Ωτ will be O(C3d6) if Moore–Penrose inverse is employed.

    Now, consider an equivalent Gaussian family {N(w|μ,Σ)|μRd,Σ0} with λ=(μ,vech(Σ)), where λ is an intrinsic parameterization of the Gaussian family and initialized by mean μ0 and covariance Σ0.

    As we will show in Part V, the iteration cost of natural-gradient descent in this parameter space Ωλ will be O(d3) if we exploit structures of the exact non-singular FIM.

  3. It is tempting to approximate the singular FIM by an emprical FIM with a scalar damping term and use Woodbury matrix identity to reduce the iteration cost of computing natural-gradients. However, sample-based emprical approximations could be problematic [4]. Moreover, damping introduces an additional tuning hyper-parameter and destories the linear invariance property of natural-gradient descent. Such an approximation should be used with caution.

Euclidean Gradient Descent is NOT (Linearly) Invariant


For simplicity, consider an unconstrained optimization problem. minτRKhτ(τ)

Euclidean gradient descent (GD) in parametrization τ is (5)τk+1τkαgτk where gτk:=τhτ(τk) is a Euclidean gradient.

Consider a reparametrization λ so that λ=Uτ, where U is a constant (square) invertible matrix. minλRKhλ(λ):=hτ(U1λ)

The Euclidean gradient descent (GD) in parametrization λ is (6)λk+1λkαgλk where gλk:=λhλ(λk) is a Euclidean gradient.

Note that Euclidean gradients follow the transformation rule as

gτT=gλTJ where Jki:=λk(τ)τi

We can verify that J=U and gτ=UTgλ.

Notice that τ0=U1λ0 by construction. The update in (5) at iteration k=1 then can be re-expressed as τ1τ0αgτ0=U1λ0αUTgλ0U1λ1

It is easy to see that updates in (5) and (6) are NOT equivalent. Therefore, Euclidean gradient descent is not invariant.

Newton’s Method is Linearly Invariant


For simplicity, consider an unconstrained convex optimization problem. minτRKhτ(τ) where hτ(τ) is strongly convex and twice continuously differentiable w.r.t. τ.

Newton’s method under parametrization τ is τk+1τkαHτ1(τk)gτk where gτk:=τhτ(τk) is a Euclidean gradient and Hτ(τk):=τ2hτ(τk) is the Hessian.

Consider a reparametrization λ so that λ=Uτ, where U is a constant (square) invertible matrix. minλRKhλ(λ):=hτ(U1λ) where hλ(λ) is also strongly convex w.r.t. λ due to (7).

Newton’s method under parametrization λ is λk+1λkαHλ1(λk)gλk where gλk:=λhλ(λk) is a Euclidean gradient and Hτ(λk):=λ2hλ(λk) is the Hessian.

As we discussed in the previous section, Euclidean gradients follow the transformation rule as gτT=gλTJ, where J=U.

Surprisingly, for a linear transformation, the Hessian follows the transformation rule like the FIM as

(7)Hτ(τk)=τ(gτk)=τ(JTgλk)=JTτ(gλk)+[τJT]=0gλk(In linear cases, J=U is a constant)=JT[λ(gλk)]J=JTHλ(λk)J

Therefore, the direction in Newton’s method denoted by g~τk:=Hτ1(τk)gτk is transformed like natural-gradients in linear cases as

g~τk:=Hτ1(τk)gτk=[JTHλ(λk)J]1gτk=J1Hλ1(λk)JT[JTgλk]=J1Hλ1(λk)gλk=J1g~λk=Qg~λk where by the definition we have Q=J1.

The consequence is that Newton’s method like natural-gradient descent is linearly invariant.

The Hessian is not a valid manifold metric

The Hessian Hτ(τk)=τ2hτ(τk) in general is not a valid manifold metric since it does not follow the transformation rule of a metric in non-linear cases3.

Contrastingly, the FIM is a valid manifold metric. Recall that the FIM can also be computed as Fτ(τ)=Ep(w|τ)[τ2logp(w|τ)].

Claim:

The FIM follows the transformation rule even in non-linear cases.

Proof of the claim (click to expand)

Proof

Given a non-linear intrinsic reparametrization λ, recall that the Jacobian matrix J(τk) in (7) is no longer a constant matrix but a square and non-singular matrix. In this case, the FIM still follows the transformation rule thanks to the expectation of the score function.

Fτ(τk)=Ep(w|τk)[τ2logp(w|τk)]=Ep(w|λk)[τ2logp(w|τk)]=Ep(w|λk)[τ[JT(τk)λlogp(w|λk)]]=JT(τk)Ep(w|λk)[τλlogp(w|λk)][τJT(τk)]Ep(w|λk)[λlogp(w|λk)]=0 (the expectation of the score is zero)=JT(τk)Ep(w|λk)[[λ2logp(w|λk)]J(τk)]=JT(τk)Ep(w|λk)[λ2logp(w|λk)]J(τk)=JT(τk)Fλ(λk)J(τk)

We will discuss in Part V, for a special parametrization τ (known as a natural parametrization) of exponential family, the FIM under this parametrization can be computed as Fτ(τ)=τ2Aτ(τ), where Aτ(τ) is a strictly convex function.

In the literature, the exponential family with a natural parametrization is known as a Hessian manifold [5], where the FIM under this kind of parametrization is called a Hessian metric. However, a non-linear reparametrization will lead to a non-natural parametrization.


References

[1] S.-I. Amari, "Natural gradient works efficiently in learning," Neural computation 10:251–276 (1998).

[2] C. Atkinson & A. F. S. Mitchell, "Rao’s distance measure," Sankhyā: The Indian Journal of Statistics, Series A 345–365 (1981).

[3] M. E. Khan, R. Babanezhad, W. Lin, M. Schmidt, & M. Sugiyama, "Faster stochastic variational inference using Proximal-Gradient methods with general divergence functions," Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence (2016), pp. 319–328.

[4] F. Kunstner, P. Hennig, & L. Balles, "Limitations of the empirical Fisher approximation for natural gradient descent," Advances in Neural Information Processing Systems 32:4156–4167 (2019).

[5] H. Shima, The geometry of Hessian structures (World Scientific, 2007).

Footnotes:

  1. Informally, we locally identify the parameter space with the tanget vector space. However, not all properties are the same in these two spaces. For example, the weighted inner product induced by the Riemannian metric is only defined in the vector space. We often do not use the same weighted inner product in the parameter space. 

  2. In Riemannian geometry, a geodesic is induced by the Levi-Civita connection. This connection is known as the metric compatiable parallel transport. Christoffel symbols are used to represent the connection in a coordinate/parametrization system. 

  3. This is due to the non-zero partial derivatives of the Jacobian matrix (Eq. (7)) when it comes to non-linear parameter transformations. These partial derivatives are also related to Christoffel symbols or Levi-Civita connection coefficients.