Part VI: Handling Parameter Constraints of Exponential Family In Natural-gradient Methods
Warning: working in Progress (Part VI is incomplete)
Goal (edited: 01-Jan-23)
In this blog post, we discuss about how to handle parameter constraints of exponential family.
Click to see how to cite this blog post
Handling Parameter Constraints
Recall that in Part IV, we discuss the many faces of NGD in unconstrained cases. These methods could also be exteneded in constrained cases to handle the parameter constraint.
Given a natural parameterization $\eta$
with parameter constraint $\Omega_\eta$
, consider the following problem
$$
\begin{aligned}
\min_{\eta \in \Omega_\eta} \ell_\eta(\mathbf{\eta}) = E_{p(w|\eta)} [h(w) + \log p(w|\eta)]
\end{aligned}
$$
We could also consider the reparameterized problem by using the expectation parameter $\mathbf{m}$
.
$$
\begin{aligned}
\min_{m \in \Omega_m} \ell_m(\mathbf{m}) = E_{p(w| \eta(m) )} [h(w) + \log p(w|\eta(\mathbf{m}))] = \ell_\eta(\eta(\mathbf{m}))
\end{aligned}
$$
Note:
-
Recall that updates in
$\eqref{1}$
and$\eqref{3}$
are equivalent in exponential family cases even when$\Omega_m$
is constrained as long as$\Omega_\eta$
is unconstrained. -
When
$\Omega_\eta$
is constrained, updates in$\eqref{1}$
,$\eqref{2}$
,$\eqref{3}$
, and$\eqref{4}$
are distinct methods. -
Since both
$\Omega_m$
and$\Omega_\eta$
can be arbitrary open subsets in$\mathcal{R}^K$
in general,$\eqref{1}$
,$\eqref{2}$
,$\eqref{3}$
, and$\eqref{4}$
can be difficult to solve.
Unconstrained Reparametrization
A straightforward approach to handle a constraint is via an unconstrained reparametrization, where we use the transformation rule for natural-gradients as discussed in Part III.
However, we have to compute the Jacobian matrix used the transformation rule. It could be possible to use Auto-Diff to compute (implicit) natural-gradients as suggested by [1]. However, an unconstrained reparametrization can easily destroy structures in a parameter space due to the (Jacobian) matrix product. Moreover, it is often difficult for Auto-Diff to exploit sparsity such as automatically using a sparse linear solver. Please see Appendix G.1 of [2] for detailed discussion about this issue.
Projected Natural Gradient Descent
Another straightforward approach from natural-gradient descent is the projected natural-gradient descent, where we use
the FIM $\mathbf{F}_\eta(\eta_k)$
evaluted at $\eta_k$
as a projection metric and use the weighted inner product to measure the distance for the projection.
$$
\begin{aligned}
& \eta_{k+1} \leftarrow \arg\min_{ {\color{red} y} \in \Omega_\eta} \| {\color{red} \mathbf{y}} - \eta_k + \alpha
\mathbf{F}_\eta^{-1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k) \|^2_{ \color{green}{ \mathbf{F}_\eta(\eta_k)} }\\
=& \arg\min_{ y \in \Omega_\eta} \big[ (\mathbf{y}-\eta_k) + \alpha\mathbf{F}_\eta^{-1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k)
\big]^T \mathbf{F}_{\eta}(\eta_k) \big[ (\mathbf{y}-\eta_k) + \alpha\mathbf{F}_\eta^{-1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k)
\big]\\
=& \arg\min_{ y \in \Omega_\eta} 2\alpha \big[ \frac{1}{2\alpha} (\mathbf{y}-\eta_k)^T \mathbf{F}_{\eta}(\eta_k) (\mathbf{y}-\eta_k) + (\mathbf{y}-\eta_k)^T \nabla_\eta \ell_\eta(\eta_k) + \underbrace{ \frac{\alpha}{2} \nabla_\eta^T \ell_\eta(\eta_k) \mathbf{F}^{-1}_\eta(\eta_k) \nabla_\eta \ell_\eta(\eta_k)}_{\text{constant w.r.t. } y} \big] \\
=& \arg\min_{ {\color{red}y} \in \Omega_\eta } \{ \langle \nabla_\eta \ell_\eta(\eta_k),{\color{red}\mathbf{y}}-\eta_k \rangle + \frac{1}{2\alpha} (\mathbf{y}-\eta_k)^T \mathbf{F}_{\eta}(\eta_k) (\mathbf{y}-\eta_k) \}
\end{aligned}\tag{1}\label{1}
$$
This approach is closely related to proximial-gradient descent.
Recall that in
Part IV,
we show that natural-gradient descent can be viewed as an proximal-gradient method, where we use the
second-order Taylor approximation of any f-divergence $\mathrm{D}_f(\mathbf{y},\eta_k)$
at $y=\eta_k$
:
$$
\begin{aligned}
\mathrm{D}_f(\mathbf{y},\eta_k) \approx \frac{1}{2} (\mathbf{y}-\eta_k)^T \mathbf{F}_{\eta}(\eta_k) (\mathbf{y}-\eta_k)
\end{aligned}
$$
Proximal Gradient Descent
We could also obtain proximal gradient descent by using a f-divergence $\mathrm{D}_f(\mathbf{y},\eta_k)$
without the Taylor approximation.
$$
\begin{aligned}
\eta_{k+1} \leftarrow \arg\min_{ {\color{red}y} \in \Omega_\eta } \{ \langle \nabla_\eta \ell_\eta(\eta_k),{\color{red}\mathbf{y}}-\eta_k \rangle + \frac{1}{\alpha} \mathrm{D}_f( {\color{red} \mathbf{y}},\eta_k) \}
\end{aligned}\tag{2}\label{2}
$$
We have the following additional results, when the f-divergence is chosen to be a KL divergence.
-
The KL divergence
$\mathrm{KL} [p(\mathbf{w}|\eta_k) || p(\mathbf{w}|{\color{red}\mathbf{y}})]=\mathrm{D}_f({\color{red}\mathbf{y}},\eta_k)=\mathrm{B}_{A_\eta}(\eta_k,{\color{red}\mathbf{y}})$
is a f-divergence and a Bregman divergence. The second-order Taylor approximation at$\mathbf{y}=\eta_k$
is$$ \begin{aligned} \mathrm{KL} [p(\mathbf{w}|\eta_k) || p(\mathbf{w}|{\color{red}\mathbf{y}})] \approx \frac{1}{2} (\mathbf{y}-\eta_k)^T \mathbf{F}_{\eta}(\eta_k) (\mathbf{y}-\eta_k) \end{aligned} $$
-
The second-order Taylor approximation of the reverse KL divergence
$\mathrm{KL} [p(\mathbf{w}|{ \color{red} \mathbf{y}} ) || p(\mathbf{w}|\eta_k)]$
at$\mathbf{y}=\eta_k$
gives the same approximation:$$ \begin{aligned} \mathrm{KL} [p(\mathbf{w}|{\color{red}\mathbf{y}}) || p(\mathbf{w}|\eta_k)] \approx \frac{1}{2} (\mathbf{y}-\eta_k)^T \mathbf{F}_{\eta}(\eta_k) (\mathbf{y}-\eta_k) \end{aligned} $$
Note:
The KL divergence is the only divergence [3] that is both a f-divergence and a Bregman divergence.
We have the following identity for the KL divergence.
$$
\begin{aligned}
\mathrm{KL} [p(\mathbf{w}|\eta_k) || p(\mathbf{w}|{\color{red}\mathbf{y}})] =
\mathrm{D}_f({\color{red}\mathbf{y}},\eta_k) = \mathrm{B}_{A_\eta}(\eta_k,{\color{red}\mathbf{y}}) = \mathrm{B}_{A^*_\eta}({\color{red}\mathbf{x}},\mathbf{m}_k)
= \mathrm{KL} [p(\mathbf{w}|\mathbf{m}_k) || p(\mathbf{w}|{\color{red}\mathbf{x}})]
\end{aligned}
$$
where $\mathbf{y}$
is a natural parameter and $\mathbf{x}$
is the corresponding expectation parameter.
Mirror Descent
Mirror descent in the expectation space remains the same as in Part V.
$$
\begin{aligned}
\mathbf{m}_{k+1} \leftarrow \arg \min_{ {\color{red} x} \in \Omega_m}\{ \langle \nabla_m \ell_m(\mathbf{m}_k), {\color{red}\mathbf{x}}-\mathbf{m}_k \rangle + \frac{1}{\alpha} \mathrm{B}_{A^*_\eta}({\color{red}\mathbf{x}},\mathbf{m}_k) \}
\end{aligned}\tag{3}\label{3}
$$
where
$\nabla_m \ell_m(\mathbf{m}_k) = \nabla_m \ell_\eta( \underbrace{ \eta(\mathbf{m}_k)}_{=\eta_k} )= \mathbf{F}_\eta^{-1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k)$
.
We could also perform mirror descent in the natural parameter space as
$$
\begin{aligned}
\mathbf{\eta}_{k+1} \leftarrow \arg \min_{ {\color{red}y} \in \Omega_\eta}\{ \langle \nabla_\eta \ell_\eta(\mathbf{\eta}_k), { \color{red}\mathbf{y} }-\mathbf{\eta}_k \rangle + \frac{1}{\alpha} \mathrm{B}_{A_\eta}( { \color{red}\mathbf{y} },\mathbf{\eta}_k) \}
\end{aligned}\tag{4}\label{4}
$$
Note:
Without the Taylor approximation, $\eqref{2}$
and $\eqref{4}$
are distinct updates since the KL divergence is not symmetric.
Adaptive Step-size Selection
Since $\Omega_\eta$
is an open set in $\mathcal{R}^K$
, the standard natural-gradient descent is still valid when a step-size is small enough.
One idea is to use an adaptive step-size for natural-gradient descent without a projection.
$$
\begin{aligned}
\eta_{k+1} \leftarrow \eta_k - \alpha_k \mathbf{F}_\eta^{-1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k)
\end{aligned}\tag{5}\label{5}
$$
where the step-size $\alpha_k$
is selected so that
$\eta_{k+1} \in \Omega_\eta$
.
However, for a general parameter constraint $\Omega_\eta$
, this approach could result in a slow progression of the method.
The step-size selection precedure has to check the constraint at each iteration and could select an extremely small step-size
$\alpha_k$
.
Riemannian Gradient Descent
An alternative approach is to use Riemannian gradient descent as we discussed in Part IV, which is a generalization of natural-gradient descent. Note that this approach is completely different from mirror descent.
To avoid solving the geodeisc ODE, we could use an approximation of the geodesic, which
induces a retraction map.
$$
\begin{aligned}
\eta_{k+1} \leftarrow \mathrm{Ret}_{\eta_k} (- \alpha \mathbf{F}_\eta^{-1} (\eta_k) \nabla_\eta \ell_\eta(\eta_k) )
\end{aligned}\tag{6}\label{6}
$$
As mentioned in
Part IV,
we have to carefully select a retraction map to handle the parameter constraint.
Given
a general parameter constraint $\Omega_\eta$
, it can be difficult to come out an efficient retraction map to satisfy
the constraint.
For positive-definite constraints in $\Omega_\eta$
, please see [4] as an example to derive efficient Riemannian gradient updates.
References
[1] H. Salimbeni, S. Eleftheriadis, & J. Hensman, "Natural gradients in practice: Non-conjugate variational inference in Gaussian process models," International Conference on Artificial Intelligence and Statistics (PMLR, 2018), pp. 689–697.
[2] W. Lin, F. Nielsen, M. E. Khan, & M. Schmidt, "Tractable structured natural gradient descent using local parameterizations," International Conference on Machine Learning (ICML) (2021).
[3] S.-ichi Amari, Information geometry and its applications (Springer, 2016).
[4] W. Lin, M. Schmidt, & M. E. Khan, "Handling the positive-definite constraint in the bayesian learning rule," International Conference on Machine Learning (PMLR, 2020), pp. 6116–6126.