lanczos_method

Power Method Recap

Description of the Method

$\vec{v}_1$ $X \in \R^{n\times d}$ $X = U\Sigma V^T$ . The method when as followed:

Power Method:

$\vec{z}^{(0)}$ $\vec{z}^{(0)} \sim \mathcal{N}(0,1)$ .
$\vec{z}^{(0)} = \vec{z}^{(0)} /\|\vec{z}^{(0)}\|_2$
$i = 1,\ldots, q$
- $\vec{z}^{(i)} = X^T{X}\vec{z}^{(i-1)} )$
- $n_i = \|\vec{z}^{(i)}\|_2$
- $\vec{z}^{(i)} = \vec{z}^{(i)}/n_i$
$\vec{z}^{(q)}$

$\vec{z}^{i}$ is simply a scaling of a column in the following matrix:

K = \begin{bmatrix} \vec{z}^{(0)} & A\vec{z}^{(0)} & A^2 \vec{z}^{(0)} & A^3\vec{z}^{(0)} \ldots A^q \vec{z}^{(0)}\end{bmatrix},

$A = X^TX$ $q \ll d$ $K$ $K$ Krylov subspace $\vec{z}^{(0)}$ $A$ $K$ shortly.

Analysis Preliminaries

$\vec{v}_1, \ldots, \vec{v}_d$ $X$ $V$ ). We write each iterate in terms of this basis of vectors:

\begin{align*} \vec{z}^{(0)} &= c_1^{(0)}\vec{v}_1 + {c}_2^{(0)}\vec{v}_2 + \ldots + c_d^{(0)} \vec{v}_d \\ \vec{z}^{(1)} &= c_1^{(1)}\vec{v}_1 + {c}_2^{(1)}\vec{v}_2 + \ldots + c_d^{(1)} \vec{v}_d\\ &\vdots\\ \vec{z}^{(q)} &= c_1^{(q)}\vec{v}_1 + {c}_2^{(i)}\vec{v}_2 + \ldots + c_d^{(q)} \vec{v}_d \end{align*}

$c_j^{(i)} = \langle \vec{z}^{(i)}, v_i\rangle$ $c_1^{(i)}$ increase $i$ $\vec{z}^{(q)}$ $\vec{v}_1$ $\vec{z}^{(i)} = \frac{1}{n_i}A^q \vec{z}^{(i-1)}$ $c_j^{(i)} = \frac{1}{n_i}\sigma_j^2 c^{(i-1)}_j$ $\frac{1}{n_i}$ $j$ $\sigma_j^2$ $j=1$ $c_1^{(i)}$ will increase in size more than any other term. We will analyze this formally soon.

First however, let's see what it suffices to prove.

Claim 1 $c_j^{(q)}/c_1^{(q)} \leq \sqrt{\epsilon/d}$ $j\neq 1$ $\|\vec{v}_1 -\vec{z}^{(q)}\|_2^2\leq 2\epsilon$ $\|-\vec{v}_1 - \vec{z}^{(q)}\|_2^2\leq 2\epsilon$ .

Proof. $c_1^{(q)} \leq 1$ $c_j^{(q)} \leq \sqrt{\epsilon/d}$ $\sum_{k=1}^d \left(c_k^{(q)}\right)^2 = 1$ $\left(c_1^{(q)}\right)^2 \geq (1-\epsilon)$ $\left|c_1^{(q)}\right| \geq 1-\epsilon$ $\vec{x}$ $\|\vec{x} -\vec{z}^{(q)}\|_2 = 2 - 2\langle\vec{x},\vec{z}^{(q)}\rangle$ $\langle\vec{v}_1,\vec{z}^{(q)}\rangle = c_1^{(q)}$ $\langle\vec{v}_1,\vec{z}^{(q)}\rangle \geq 1-\epsilon$ $\langle- \vec{v}_1,\vec{z}^{(q)}\rangle \geq 1-\epsilon$ $\langle\vec{v}_1,\vec{z}^{(q)}\rangle \geq 1-\epsilon$ $\|\vec{v}_1 -\vec{z}^{(q)}\|_2 = 2 - 2\langle\vec{v}_1,\vec{z}^{(q)}\rangle \leq 2 - 2\cdot (1-\epsilon) = 2\epsilon$ $\square$

$\left|c_j^{(q)}/c_1^{(q)}\right| \leq \sqrt{\epsilon/d}$ $j\neq 1$ .

Heart of the Analysis

$c_j^{(q)} = S\cdot \sigma_j^{2q} c_j^{(0)}$ $j$ $c_1^{(q)} = S\cdot \sigma_1^{2q} c_1^{(0)}$ $S = 1/\prod_{i=1}^q n_i$ is some fixed scaling. So:

\left|\frac{c_j^{(q)} }{c_1^{(q)} }\right| = (\sigma_j/\sigma_1)^{2q} \left|\frac{c_j^{(0)} }{c_1^{(0)} }\right|.

$z^{(0)}$ $\left|\frac{c_j^{(0)} }{c_1^{(0)} }\right| \leq d^{3}$ $(\sigma_j/\sigma_1)^{2q}$ is going to be tiny number, so will easily cancel that out. In particular,

(\sigma_j/\sigma_1)^{2q} = \left(1-\frac{\sigma_1-\sigma_j}{\sigma_1}\right)^{2q} \leq (1-\gamma)^{2q},

$\gamma = \frac{\sigma_1-\sigma_2}{\sigma_1}$ is our spectral gap parameter. As long as we set

q = \frac{\log(d^3\sqrt{d/\epsilon})}{\gamma} = O\left(\frac{\log(d/\epsilon)}{\gamma}\right),

$(1-\gamma)^{2q} \leq \frac{\sqrt{\epsilon/d}}{d^3}$ , and thus

\left|\frac{c_j^{(q)} }{c_1^{(q)} }\right| = (\sigma_j/\sigma_1)^{2q} \left|\frac{c_j^{(0)} }{c_1^{(0)} }\right| \leq d^3\cdot \frac{\sqrt{\epsilon/d}}{d^3} \leq \sqrt{\epsilon/d},

as desired.

Alternative Guarantee

$\vec{v}_1$ $\vec{z}$ $X$ $\|X - X\vec{z}\vec{z}^T\|_F^2 = \|X\|_F^2 - \|X\vec{z}\vec{z}^T\|_F^2$ $\|X\vec{z}\vec{z}^T\|_F^2$ $\|X\vec{v}_1\vec{v}_1^T\|_F^2 = \sigma_1^2$ $\langle\vec{v}_1,\vec{z}^{(q)}\rangle \geq 1-\epsilon$ $\langle- \vec{v}_1,\vec{z}^{(q)}\rangle \geq 1-\epsilon$ $\|X\vec{z}\vec{z}^T\|_F^2 \geq (1-\epsilon)^2 \sigma_1^2$ $O\left(\frac{\log(d/\epsilon)}{\gamma}\right)$ we get a near optimal low-rank approximation.

The Lanczos Method

We will now see how to improve on power method using what is known as the Lanczos method. Like power method, Lanczos is considered a Krylov subspace methodin the span of the Krylov subspace $K$ $1/\sqrt{\gamma}$ $1/\gamma$ .

$Q\in \R^{d\times k}$ $K$ $Q$ $K$ .

Lanczos Method

$Q$ $q$ Krylov subspace.
$\vec{z}$ $Q^TA{Q} = {Q}^T{X}^T{X}{Q}$
$Q\vec{z}$ .

$q$ $X^TX$ $O(nd)$ ${X}^TX$ $Q^TA{Q}$ $q\times q$ $q\ll d$ $X^TX$ ${Q}^T{X}^T{X}{Q}$ $O(ndq) + O(dq^2) + O(q^3) + O(ndq)$ $Q$ $XQ$ ${Q}^T{X}^T{X}{Q}$ , and the third to find its top eigenvector using a direct method).

Analysis Preliminaries

Our first claim is that Lanczos returns the best approximate singular vector in the span of the Krylov subspace. Then we will argue that there always exists some vector in the span of the subspace that is significantly better than what power method returns, so the Lanczos solution must be significantly better as well.

Claim 2 $\vec{y}$ $\vec{y} = Q\vec{x}$ $x\in \R^k$ $\vec{y}^* = Q\vec{z}$ $\|X - X\vec{y}\vec{y}^T\|_F^2$ .

Proof. $\vec{y}$ $\vec{y}\vec{y}^T$ $\vec{x} = Q^T\vec{y}$ $\vec{y}^* = {Q}\vec{z}$ $\|X\vec{y}\vec{y}^T\|_F^2 = \|X\vec{y}\|_F^2 = \|X\vec{y}\|_2^2 = \|XQ\vec{x}\|_2^2$ $\vec{x}$ ${X}{Q}$ $\|XQ\vec{x}\|_2^2$ $Q^TX^TXQ$ .

Claim 3 $O\left(\frac{\log(d/\epsilon)}{\sqrt{\gamma}}\right)$ some $\vec{w}$ $\vec{w} = Q\vec{x}$ $\langle\vec{v}_1,\vec{w}\rangle \geq 1-\epsilon$ $\langle- \vec{v}_1,\vec{w}\rangle \geq 1-\epsilon$ .

$\vec{w}$ $\vec{v}_1$ $\|X\vec{w}\vec{w}^T\|_F^2$ Claim 2 $\vec{v}$ returned by Lanczos can only do better. So, we focus on proving Claim 3.

Heart of the Analysis

$q$ $\vec{w}$ $\vec{w} = Q\vec{x}$ is equal to:

\vec{w} = p(A)\vec{z}^{(0)},

degree q polynomial $p$ $p(A) = 2A^2 - 4A^3 + A^6$ $p(A) = I - A - 10A^5$ $q$ $p$ some $\vec{x}$ $Q\vec{x} = p(A)\vec{z}^{(0)}$ $\vec{z}^{(0)}, A\vec{z}^{(0)}, \ldots, A^q\vec{z}^{(0)}$ $Q$ , so any linear combination does as well.

$\vec{w}$ $p(A)$ $p(A)\vec{z}^{(0)}$ in the span of the singular vectors we have:

p(A)\vec{z}^{(0)} = g_1 \vec{v}_1 + g_2 \vec{v}_2 + \ldots + g_d \vec{v}_d

where

g_j = c_j^{(0)}p(\sigma_j^2).

$g_1$ much larger $g_j$ $j \neq 1$ $p(t)$ $0\leq t< \sigma_1^2$ jump $\sigma_1$ $q$ $p(t) = t^q$ . However, it turns out there are more sharplyChebyshev polynomial $\sigma_1^2 = 1$ $t^q$ $p$ $0\leq t< 1$ $t=1$ .

Concretely we can claim the following, which is a bit tricky to prove, but well known (see Lemma 5 here for a full proof).

Claim 4: $O\left(\sqrt{\frac{1}{\gamma}}\log\frac{1}{\epsilon}\right)$ $\hat{p}$ $\hat{p}(1) = 1$ $|p(t)| \leq \epsilon$ $0 \leq t \leq 1-\gamma$ .

$t^q$ $O\left({\frac{1}{\gamma}\log\frac{1}{\epsilon}}\right)$ -- a quadratically worse bound. This is what will account for the quadratic difference in performance between the Lanczos and Power Methods.

$t^q$ is the sort of thing studied in the mathematical field known as Approximation Theory. That might see pretty obscure, but steep polynomials are surprisingly useful in computer science, appearing everywhere from classic, to Quantum complexity theory, to learning theory. If you're interested in learning more about this, check out slides for this talk.

Finishing Up

Claim 4 $\hat{p}\left(\frac{1}{\sigma_1^2}A\right)\vec{z}^{(0)}$ , which as argued above lies in the Krylov subspace. As discussed below equation (7), our job is to prove that:

\frac{c_j^{(0)}\hat{p}\left(\frac{1}{\sigma_1^2}\sigma_j^2\right)}{c_1^{(0)}\hat{p}\left(\frac{1}{\sigma_1^2}\sigma_1^2\right)} \leq \sqrt{\epsilon/d},

$j \neq i$ $\hat{p}$ $q = O\left(\sqrt{\frac{1}{\gamma}}\log\frac{d}{\epsilon}\right)$

$\frac{\sigma_j^2}{\sigma_1^2} = \left(\frac{\sigma_j}{\sigma_1}\right)^2 = \left(1 - \left(1-\frac{\sigma_j}{\sigma_1}\right)\right)^2 = \left(1 - \frac{\sigma_1 - \sigma_j}{\sigma_1}\right)^2 \leq (1-\gamma)^2 \leq (1-\gamma)$ .

$q = O\left(\sqrt{\frac{1}{\gamma}}\log\frac{1}{\epsilon'}\right)$ $\epsilon' = \sqrt{\epsilon/d}/d^3$ $\hat{p}\left(\frac{1}{\sigma_1^2}\sigma_j^2\right) \leq \epsilon'$ $c_1^{(0)}\hat{p}(1) = c_1^{(0)}\cdot 1$ $c_j^{(0)}/c_1^{(0)}\leq d^3$ Claim 3 $\vec{w} = \hat{p}\left(\frac{1}{\sigma_1^2}A\right)\vec{z}^{(0)}$ .