Part VI: Handling Parameter Constraints of Exponential Family In Naturalgradient Methods
Warning: working in Progress (Part VI is incomplete)
Goal (edited: 01Jan23)
In this blog post, we discuss about how to handle parameter constraints of exponential family.
Click to see how to cite this blog post
Handling Parameter Constraints
Recall that in Part IV, we discuss the many faces of NGD in unconstrained cases. These methods could also be exteneded in constrained cases to handle the parameter constraint.
Given a natural parameterization $\eta$
with parameter constraint $\Omega_\eta$
, consider the following problem
$$
\begin{aligned}
\min_{\eta \in \Omega_\eta} \ell_\eta(\mathbf{\eta}) = E_{p(w\eta)} [h(w) + \log p(w\eta)]
\end{aligned}
$$
We could also consider the reparameterized problem by using the expectation parameter $\mathbf{m}$
.
$$
\begin{aligned}
\min_{m \in \Omega_m} \ell_m(\mathbf{m}) = E_{p(w \eta(m) )} [h(w) + \log p(w\eta(\mathbf{m}))] = \ell_\eta(\eta(\mathbf{m}))
\end{aligned}
$$
Note:

Recall that updates in
$\eqref{1}$
and$\eqref{3}$
are equivalent in exponential family cases even when$\Omega_m$
is constrained as long as$\Omega_\eta$
is unconstrained. 
When
$\Omega_\eta$
is constrained, updates in$\eqref{1}$
,$\eqref{2}$
,$\eqref{3}$
, and$\eqref{4}$
are distinct methods. 
Since both
$\Omega_m$
and$\Omega_\eta$
can be arbitrary open subsets in$\mathcal{R}^K$
in general,$\eqref{1}$
,$\eqref{2}$
,$\eqref{3}$
, and$\eqref{4}$
can be difficult to solve.
Unconstrained Reparametrization
A straightforward approach to handle a constraint is via an unconstrained reparametrization, where we use the transformation rule for naturalgradients as discussed in Part III.
However, we have to compute the Jacobian matrix used the transformation rule. It could be possible to use AutoDiff to compute (implicit) naturalgradients as suggested by [1]. However, an unconstrained reparametrization can easily destroy structures in a parameter space due to the (Jacobian) matrix product. Moreover, it is often difficult for AutoDiff to exploit sparsity such as automatically using a sparse linear solver. Please see Appendix G.1 of [2] for detailed discussion about this issue.
Projected Natural Gradient Descent
Another straightforward approach from naturalgradient descent is the projected naturalgradient descent, where we use
the FIM $\mathbf{F}_\eta(\eta_k)$
evaluted at $\eta_k$
as a projection metric and use the weighted inner product to measure the distance for the projection.
$$
\begin{aligned}
& \eta_{k+1} \leftarrow \arg\min_{ {\color{red} y} \in \Omega_\eta} \ {\color{red} \mathbf{y}}  \eta_k + \alpha
\mathbf{F}_\eta^{1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k) \^2_{ \color{green}{ \mathbf{F}_\eta(\eta_k)} }\\
=& \arg\min_{ y \in \Omega_\eta} \big[ (\mathbf{y}\eta_k) + \alpha\mathbf{F}_\eta^{1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k)
\big]^T \mathbf{F}_{\eta}(\eta_k) \big[ (\mathbf{y}\eta_k) + \alpha\mathbf{F}_\eta^{1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k)
\big]\\
=& \arg\min_{ y \in \Omega_\eta} 2\alpha \big[ \frac{1}{2\alpha} (\mathbf{y}\eta_k)^T \mathbf{F}_{\eta}(\eta_k) (\mathbf{y}\eta_k) + (\mathbf{y}\eta_k)^T \nabla_\eta \ell_\eta(\eta_k) + \underbrace{ \frac{\alpha}{2} \nabla_\eta^T \ell_\eta(\eta_k) \mathbf{F}^{1}_\eta(\eta_k) \nabla_\eta \ell_\eta(\eta_k)}_{\text{constant w.r.t. } y} \big] \\
=& \arg\min_{ {\color{red}y} \in \Omega_\eta } \{ \langle \nabla_\eta \ell_\eta(\eta_k),{\color{red}\mathbf{y}}\eta_k \rangle + \frac{1}{2\alpha} (\mathbf{y}\eta_k)^T \mathbf{F}_{\eta}(\eta_k) (\mathbf{y}\eta_k) \}
\end{aligned}\tag{1}\label{1}
$$
This approach is closely related to proximialgradient descent.
Recall that in
Part IV,
we show that naturalgradient descent can be viewed as an proximalgradient method, where we use the
secondorder Taylor approximation of any fdivergence $\mathrm{D}_f(\mathbf{y},\eta_k)$
at $y=\eta_k$
:
$$
\begin{aligned}
\mathrm{D}_f(\mathbf{y},\eta_k) \approx \frac{1}{2} (\mathbf{y}\eta_k)^T \mathbf{F}_{\eta}(\eta_k) (\mathbf{y}\eta_k)
\end{aligned}
$$
Proximal Gradient Descent
We could also obtain proximal gradient descent by using a fdivergence $\mathrm{D}_f(\mathbf{y},\eta_k)$
without the Taylor approximation.
$$
\begin{aligned}
\eta_{k+1} \leftarrow \arg\min_{ {\color{red}y} \in \Omega_\eta } \{ \langle \nabla_\eta \ell_\eta(\eta_k),{\color{red}\mathbf{y}}\eta_k \rangle + \frac{1}{\alpha} \mathrm{D}_f( {\color{red} \mathbf{y}},\eta_k) \}
\end{aligned}\tag{2}\label{2}
$$
We have the following additional results, when the fdivergence is chosen to be a KL divergence.

The KL divergence
$\mathrm{KL} [p(\mathbf{w}\eta_k)  p(\mathbf{w}{\color{red}\mathbf{y}})]=\mathrm{D}_f({\color{red}\mathbf{y}},\eta_k)=\mathrm{B}_{A_\eta}(\eta_k,{\color{red}\mathbf{y}})$
is a fdivergence and a Bregman divergence. The secondorder Taylor approximation at$\mathbf{y}=\eta_k$
is$$ \begin{aligned} \mathrm{KL} [p(\mathbf{w}\eta_k)  p(\mathbf{w}{\color{red}\mathbf{y}})] \approx \frac{1}{2} (\mathbf{y}\eta_k)^T \mathbf{F}_{\eta}(\eta_k) (\mathbf{y}\eta_k) \end{aligned} $$

The secondorder Taylor approximation of the reverse KL divergence
$\mathrm{KL} [p(\mathbf{w}{ \color{red} \mathbf{y}} )  p(\mathbf{w}\eta_k)]$
at$\mathbf{y}=\eta_k$
gives the same approximation:$$ \begin{aligned} \mathrm{KL} [p(\mathbf{w}{\color{red}\mathbf{y}})  p(\mathbf{w}\eta_k)] \approx \frac{1}{2} (\mathbf{y}\eta_k)^T \mathbf{F}_{\eta}(\eta_k) (\mathbf{y}\eta_k) \end{aligned} $$
Note:
The KL divergence is the only divergence [3] that is both a fdivergence and a Bregman divergence.
We have the following identity for the KL divergence.
$$
\begin{aligned}
\mathrm{KL} [p(\mathbf{w}\eta_k)  p(\mathbf{w}{\color{red}\mathbf{y}})] =
\mathrm{D}_f({\color{red}\mathbf{y}},\eta_k) = \mathrm{B}_{A_\eta}(\eta_k,{\color{red}\mathbf{y}}) = \mathrm{B}_{A^*_\eta}({\color{red}\mathbf{x}},\mathbf{m}_k)
= \mathrm{KL} [p(\mathbf{w}\mathbf{m}_k)  p(\mathbf{w}{\color{red}\mathbf{x}})]
\end{aligned}
$$
where $\mathbf{y}$
is a natural parameter and $\mathbf{x}$
is the corresponding expectation parameter.
Mirror Descent
Mirror descent in the expectation space remains the same as in Part V.
$$
\begin{aligned}
\mathbf{m}_{k+1} \leftarrow \arg \min_{ {\color{red} x} \in \Omega_m}\{ \langle \nabla_m \ell_m(\mathbf{m}_k), {\color{red}\mathbf{x}}\mathbf{m}_k \rangle + \frac{1}{\alpha} \mathrm{B}_{A^*_\eta}({\color{red}\mathbf{x}},\mathbf{m}_k) \}
\end{aligned}\tag{3}\label{3}
$$
where
$\nabla_m \ell_m(\mathbf{m}_k) = \nabla_m \ell_\eta( \underbrace{ \eta(\mathbf{m}_k)}_{=\eta_k} )= \mathbf{F}_\eta^{1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k)$
.
We could also perform mirror descent in the natural parameter space as
$$
\begin{aligned}
\mathbf{\eta}_{k+1} \leftarrow \arg \min_{ {\color{red}y} \in \Omega_\eta}\{ \langle \nabla_\eta \ell_\eta(\mathbf{\eta}_k), { \color{red}\mathbf{y} }\mathbf{\eta}_k \rangle + \frac{1}{\alpha} \mathrm{B}_{A_\eta}( { \color{red}\mathbf{y} },\mathbf{\eta}_k) \}
\end{aligned}\tag{4}\label{4}
$$
Note:
Without the Taylor approximation, $\eqref{2}$
and $\eqref{4}$
are distinct updates since the KL divergence is not symmetric.
Adaptive Stepsize Selection
Since $\Omega_\eta$
is an open set in $\mathcal{R}^K$
, the standard naturalgradient descent is still valid when a stepsize is small enough.
One idea is to use an adaptive stepsize for naturalgradient descent without a projection.
$$
\begin{aligned}
\eta_{k+1} \leftarrow \eta_k  \alpha_k \mathbf{F}_\eta^{1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k)
\end{aligned}\tag{5}\label{5}
$$
where the stepsize $\alpha_k$
is selected so that
$\eta_{k+1} \in \Omega_\eta$
.
However, for a general parameter constraint $\Omega_\eta$
, this approach could result in a slow progression of the method.
The stepsize selection precedure has to check the constraint at each iteration and could select an extremely small stepsize
$\alpha_k$
.
Riemannian Gradient Descent
An alternative approach is to use Riemannian gradient descent as we discussed in Part IV, which is a generalization of naturalgradient descent. Note that this approach is completely different from mirror descent.
To avoid solving the geodeisc ODE, we could use an approximation of the geodesic, which
induces a retraction map.
$$
\begin{aligned}
\eta_{k+1} \leftarrow \mathrm{Ret}_{\eta_k} ( \alpha \mathbf{F}_\eta^{1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k) )
\end{aligned}\tag{6}\label{6}
$$
As mentioned in
Part IV,
we have to carefully select a retraction map to handle the parameter constraint.
Given
a general parameter constraint $\Omega_\eta$
, it can be difficult to come out an efficient retraction map to satisfy
the constraint.
For positivedefinite constraints in $\Omega_\eta$
, please see [4] as an example to derive efficient Riemannian gradient updates.
References
[1] H. Salimbeni, S. Eleftheriadis, & J. Hensman, "Natural gradients in practice: Nonconjugate variational inference in Gaussian process models," International Conference on Artificial Intelligence and Statistics (PMLR, 2018), pp. 689–697.
[2] W. Lin, F. Nielsen, M. E. Khan, & M. Schmidt, "Tractable structured natural gradient descent using local parameterizations," International Conference on Machine Learning (ICML) (2021).
[3] S.ichi Amari, Information geometry and its applications (Springer, 2016).
[4] W. Lin, M. Schmidt, & M. E. Khan, "Handling the positivedefinite constraint in the bayesian learning rule," International Conference on Machine Learning (PMLR, 2020), pp. 6116–6126.