Introduction to residual correction
Two ways of finding $w$ such that $f(w)=b$:
- solve $f(w)=b$ by using approximate inverse $g\approx f^{-1}$ on current residual $f(w)-b$ to update $w$
- minimize $\|f(w)-b\|^2$ using gradient descent in transformed coordinates $u=T w$
The former approach is called the method of "residual correction" in scientific literature. The latter view is common in Machine Learning. These two views suggest following two ways to design a new iterative method :
- Craft an approximate inverse $g\approx f^{-1}$
- Come up with a coordinate transformation $T$
Often these views are interchangeable although "residual correction" view is more general.
Taking $f(w)=Aw$ we get Newton's method if we:
- apply residual correction with $g(u)=A^\dagger u$
- perform gradient descent with $T=(A^TA)^{-1/2}$.
Residual correction algorithm
Suppose we seek $w$ such that $f(w)=b$. Start with a random $w$ and apply the following update:
$$\begin{equation} \label{iterate} w=w-g(f(w)-b) \end{equation}$$Here, $g$ is the inverse of $f$ and this iteration converges in a single step. Make it cheaper by using an approximate inverse $g$ with $g(f(w))\approx w$. Approximation errors means we might not get the answer in a single step, but under some conditions iterating Eq $\ref{iterate}$ will shrink the error to 0.
Chatelin, in Section 2.12 calls $g$ "an approximate local inverse". A necessary condition for $\ref{iterate}$ to converge is that $I-g\circ f$ is a stable map. A sufficient condition is that it is contractive. Generally speaking, $g(f(w))$ must stay close to $w$. It converges if there's some norm $\|\cdot\|$ such that the following holds for all relevant $w$:
$$\|g(f(w))-w\|<\|w\|$$
Will can get $g$ by locally approximating $f$ with a linear function $f(w)\approx Aw$ and inverting that. Consider linear $f(w)=b$ corresponding to the following linear system:
$$ \begin{equation} \label{original} Aw=b \end{equation} $$We can solve this in a single step of residual correction by using $g(y)=A^\dagger y$
$$ \begin{equation} \label{newton} w=w-A^\dagger(Aw-b) \end{equation} $$For a greater family of algorithms, the following transformation is useful. Using properties of pseudo-inverse, rewrite Eq $\ref{newton}$
\begin{equation} \label{right} w=w-(A^T A)^\dagger A^T (Aw-b) \end{equation} \begin{equation} \label{left} w=w-A^T (A A^T)^\dagger (Aw-b) \end{equation}When $A^T A$ is invertible, $(A^T A)^\dagger = (A^T A)^{-1}$ and iteration \ref{right} corresponds to Newton's method.
When $A A^T$ is invertible, $(A AT)\dagger = (A AT){-1}$ and iteration \ref{left} corresponds to Affine Projection.
Affine Projection term is used in signal processing literature whereas linear algebra papers call it block Kaczmarz.
We can approximate $A^TA$ or $A A^T$ with a diagonal matrix for a cheap approximation of $A^\dagger$. In this case we get the following updates:
\begin{equation}
\label{left:diag}
w=w-(\text{diag}(A^T A))^{-1} A^T (Aw-b)
\end{equation}
\begin{equation}
\label{right:diag}
w=w-A^T (\text{diag}(A AT)){-1} (Aw-b)
\end{equation}
These two choices of approximate inverse correspond to gradient descent with feature normalization or example normalization respectively. The former can be viewed as Jacobi iteration. The latter can be seen as Kaczmarz update applied to every example in parallel. Kaczmarz is also known as normalized LMS.
Even cheaper is to approximate $A^\dagger\approx \mu A^T$ which gives standard gradient descent iteration on least squares objective over $\ref{original}$
\begin{equation}\label{gd}w=w-\mu A^T (Aw-b)\end{equation}
Using "approximate inverse" analogy, this choice means that we approximate true inverse $A^\dagger$ with $\mu A^T$. When $A$ is a unitary matrix, $A^\dagger=A^T$ so we can set $\mu=1$.
In high dimensions, up to $k$ random vectors $x$ will be mutually orthogonal with high probability, so with row normalization, $A^T\approx A^\dagger$. This means that on a random row-normalized problem, gradient descent converges like Newton's method if the number of rows is less than $k$.
It can be shown that $k$ is proportional to $R$, the "intrinsic dimension" of $\Sigma$, the covariance of $x$.
$$R=\frac{\text{Tr}(\Sigma)}{\|\Sigma\|}$$This quantity provides
- largest batch size for which gradient descent (GD) is effective
- smallest batch size for which Newton's update improves upon GD
Iterative linear solvers
Linear iterative solvers typically deal with square $A$, we transform our problem as follows to guarantee "squareness" of $A$.
$$A^{T}Aw=A^Tb$$Applying iteration $\ref{iterate}$ with $g(w)=\mu w$ to squared system gives the following update, known as the modified Richardson iteration.
$$w=w-\mu A^{T}(Aw-b)$$
For a general family of methods, consider iterating the following equation with $g(y)=Py$ used for approximate inverse of $f(w)=Aw$
$$\begin{equation}\label{pre}w=w - P (Aw-b)\end{equation}$$
Here $P$ is our "approximate inverse" or "preconditioner". Plugging various choices of $P$ generates some named algorithms. Let $P=$
$$\begin{align} \text{gradient descent} & : A^T\\ \text{sign-GD} & : \text{sign}(A^T)\\ \text{sketched GD} & : A^T S \\ \text{Newton} & : A^\dagger=(A^TA)^\dagger A^T = A^T (AA^T)^\dagger\\ \text{KFAC-Newton} & : A_1^\dagger \otimes A_2^\dagger, A\approx A_1\otimes A_2\\ \text{Affine Projection (AP)} & : A^T(AA^T)^{-1} \\ \text{Mahalanobis AP} & : B^{-1} A^T(AB^{-1}A^T)^{-1} \\ \text{Gower/Richtarik} & : B^{-1} A^T S (S^T AB^{-1}A^TS)^\dagger S^T\\ \text{Jacobi} & : D^{-1}A^T\\ \text{Gauss-Seidel} & : (D+L)^{-1} A^T \\ \text{Newton} & : (D+L+U)^{-1}A^T\\ \text{SOR} & : (D/\omega+L)^{-1} A^T, 0<\omega<2, \omega \rightarrow 2\\ \text{richardson, left-pre} & : a^tl^t l\\ right-pre} rr^t a^t\\ \text{newton + left pre} (a^tl^t l a)^{-1}a^t l^t \text{olse} (a^t a)^{-1} a^t \\ \text{glse} w^{-1}a)^{-1} w^{-1} \end{align}$$