Last class, we saw the power of linearity of expectation, Markov’s inequality, and the union bound. This class, we will learn and use linearity of variance and Chebyshev’s inequality.

**Lemma:** Let \(X\) be
a random variable expectation \(\mathbb{E}[X]\) and variance \(\sigma^2 = \textrm{Var}[X]\). Then for any
\(k > 0\),

\[ \Pr(|X - \mathbb{E}[X]| > k \sigma) \leq \frac{1}{k^2}. \]

When the variance is smaller, we expect \(X\) to concentrate more closely around its expectation. Chebyshev’s inequality is one way to formalize this intuition.

There are two main benefits of Chebyshev’s over Markov’s :

Chebyshev’s applies to any random variable \(X\). In contrast, Markov’s requires that the random variable \(X\) is non-negative.

Chebyshev’s gives a two-sided bound: we bound the probability that \(|X - \mathbb{E}[X]|\) is large, which means that \(X\) isn’t too far above

*or*below its expectation. In contrast, Markov’s only bounds the probability that \(X\) is larger than \(\mathbb{E}[X]\).

While Chebyshev’s gives a more general bound, it also needs a bound on the variance of \(X\). In general, bounding the variance is harder than bounding the expectation.

Note: There’s no hard rule for which to apply! Both Markov’s and Chebyshev’s are useful in different settings.

**Proof of Chebyshev’s Inequality:** We’ll apply
Markov’s inequality to the non-negative random variable \(S = (X - \mathbb{E}[X])^2\). By Markov’s,
\[
\Pr\left(S \geq t \right) \leq \frac{\mathbb{E}[S]}{t}.
\] Observe that \(\mathbb{E}[S] =
\mathbb{E}[(X-\mathbb{E}[X])^2] = \textrm{Var}(X)\), plug in
\(t = k^2 \sigma^2\), and use the
definition of \(S\). Then

\[ \Pr \left((X - \mathbb{E}[X])^2 \geq k^2 \sigma^2\right) \leq \frac{\textrm{Var}(X)}{k^2 \sigma^2}. \] We’ll take the square root of the event inside the probability (this preserves the inequality) and observe that \(\textrm{Var}(X) = \sigma^2\). Then simplifying

\[ \Pr\left(|X - \mathbb{E}[X]| \geq k \sigma\right) \leq \frac{1}{k^2}. \]

In order to use Chebyshev’s inequality, we need to bound the variance. Luckily, there’s a helpful tool to help us compute the variance when we have a sum of independent random variables.

**Fact:** For *pairwise independent* random
variables \(X_1, \ldots, X_m\),

\[ \textrm{Var}[X_1 + X_2 + \ldots + X_m] = \textrm{Var}[X_1] + \textrm{Var}[X_2] + \ldots + \textrm{Var}[X_m]. \]

Notice that we require pairwise independence so for any \(i, j \in \{1, \ldots, m\}\), \(X_i\) and \(X_j\) are independent. Pairwise independence is a strictly weaker requirement than \(k\)-wise independence (when \(k>2\)) which requires that for all \(k\) variables \(X_1, \ldots, X_k\) and all values \(v_1, \ldots, v_k\), \[ \Pr(X_1 = v_1, \ldots, X_k=v_k) = \Pr(X_1=v_1) \cdot \ldots \cdot \Pr(X_k = v_k). \]

**Question:** Can you think of three random variables
that are pairwise independent but not 3-wise independent?

Here’s one answer: Let \(X_1\) be a random variable that is equally likely to be \(1\) and \(-1\), let \(X_2\) be another random variable that is equally to be \(1\) and \(-1\), and let \(X_3 = X_1 X_2\). With only two of the three, the random variables are independent. But, with all three of them, the third random variable is strongly correlated with the first two.

In order to gain familiarity with Chebyshev’s inequality and linearity of variance, let’s go through an example with coin flips. Let \(C_1, \ldots, C_{100}\) be independent random variables that are \(1\) with probability \(\frac{1}{2}\) and \(0\) otherwise.

Let the number of heads \(H\) be \(\sum_{i=1}^{100} C_i\). By linearity of expectation, we know that \[ \mathbb{E}[H] = \sum_{i=1}^{100} \mathbb{E}[C_i] = \sum_{i=1}^{100} \frac{1}{2}=50. \] Since \(C_i\) is an indicator random variable, we know \[ \textrm{Var}(C_i) = \mathbb{E}[C_i^2] - \mathbb{E}[C_i]^2 = \frac{1}{2} - \left(\frac{1}{2}\right)^2 = \frac{1}{4}. \] Then, by linearity of variance (all the flips are independent), we know that \[ \textrm{Var}(H) = \sum_{i=1}^{100} \textrm{Var}(C_i) = \sum_{i=1}^{100} \frac{1}{4} = 25. \]

Now let’s apply Chebyshev’s inequality. We know \(\mathbb{E}[H] = 50\) and \(\textrm{Var}(H) = \sigma^2 =25\). \[ \Pr(|H - 50| \geq 5 \cdot 4 ) \leq \frac{1}{4^2} = \frac{1}{16}. \] So, with at least \(93\%\) chance, we get between 30 and 70 heads.

The first problem that we’ll consider is estimating distinct elements. The problem, like many we’ll see in this class (e.g. frequent items estimation), is posed in the streaming model. In this model, we have a massive data set that arrives in a sequential stream. There is far too much data to store or process in a single location but we still want to analyze the data. To do this, we must compress the data on the fly, storing some smaller data structure which still contains interesting information.

The input to the distinct elements problem is \(x_1, \ldots, x_n\) where each item is in a
huge universe of items \(\mathcal{U}\).
The output is the number of *distinct* inputs \(D\). For example, if the input is \(1, 10, 2, 4, 9, 2, 10, 4\) then the output
is \(D=5\).

The distinct elements problem has many applications: Distinct users visiting a webpage. Distinct values in a database columns. Distinct queries to a search engine. Distinct motifs in a DNA sequence. Because the problem is so general, implementations of the algorithm we’ll describe today (and several refinements of it) are deployed in practice at companies like Google, Yahoo, Twitter, Facebook, and many more.

The naive approach to solve the problem is to build a dictionary of all items seen so far. Every time we see an element, we hash it and check if we already have it in the dictionary. Unfortunately, the space complexity is \(O(D)\) and, if we have millions or billions of distinct elements, this naive approach is infeasible.

Our goal is to return an estimate \(\hat{D}\) that satisfies \[ (1-\epsilon) D \leq \hat{D} \leq (1+\epsilon) D \] but only uses \(O(1/\epsilon^2)\) space. This should be surprising: We can use space that is (basically) independent of the number of distinct elements!

The algorithm we’ll use to accomplish this is surprisingly simple. Choose a random hash function \(h: U \rightarrow [0,1]\). We’ll initialize an estimate \(S\). For every item we see in the stream, we’ll set \[ S \gets \min(S, h(x_i)). \] Once we processed the stream, return \(\frac{1}{S} - 1\).

The figure below describes the algorithm. Elements arrive in a stream and we hash each item to the real number line between \(0\) and \(1\). By the definition of the hash function, each repeated item hashes to the same point. We maintain a value \(S\) hashed value we’ve seen so far.

Why do we return the estimate \(\hat{D} = \frac{1}{S}- 1\)? Intuitively, when \(D\) is larger \(S\), will be smaller because we get more chances to get small values. But the reason behind the exact estimate is that

**Lemma:** \(\mathbb{E}[S] =
\frac{1}{D+1}\).

We’ll see two proofs.

**Calculus Proof:** We’ll use the identity that \[
\mathbb{E}[X] = \int \Pr(X \geq x) dx
\] for a continuous and non-negative random variable \(X\). To see this, notice that \[
X = \int_{x=0}^X dx = \int_{x=0}^\infty \mathbb{1}[X \geq x] dx.
\] Taking expectation yields the identity. There are other ways
to prove this identity described in notes here.

Now let’s deploy the identity for the continuous random variable \(S\): \[ \mathbb{E}[S] = \int_{s=0}^1 \Pr(S \geq s) ds = \int_{s=0}^1 (1-s)^D d s \] \[ = \left. \frac{-(1-s)^{D+1}}{D+1} \right|_{s=0}^1 = \frac{1}{D+1}. \] In the second equality, we used the observation that the minimum \(S\) is greater than a value \(s\) if and only if all \(D\) hashed values are greater than \(s\).

We can understand every step of the calculus proof but it doesn’t
give us a deeper understanding for *why* the lemma is true. Let’s
turn to another proof “from the book”. This phrase comes from Paul
Erdős, a prolific Hungarian mathematician, who believed that there are
proofs so simple and elegant that they were written in a divine
book.

**Proof from the Book:** We’ll use the following
observation: For any event \(A\) and
random variable \(x\), \[\begin{align}
\mathbb{E}_x[\Pr[A|x]]
= \mathbb{E}_x[\mathbb{E}[\mathbb{1}[A] \mid x]]
= \mathbb{E}[\mathbb{1}[A]]
= \Pr[A].
\end{align}\] The first and last equality follow because \(\mathbb{1}[A]\) is a indicator random
variable for event \(A\). The middle
equality follows by the law of
total expectation. After a little bit of thought, the law of total
expectation is intuitive. But you can prove it to yourself by plugging
in the definition of expectations and exchanging sums/integrals, or you
can find the proof online here.

Back to our proof, we know that \(h_1,\ldots,h_D\) are i.i.d. (independent and identically distributed) uniform samples drawn from the interval \([0,1]\). Recall that \(S = \min_{i\in[D]} h_i\). Since all the \(h_i\) are drawn from the interval between \(0\) and \(1\), we can interpret \(S\) as the probability that the next hashed value is less than the minimum of the first \(D\) hashed values. Mathematically, \[\begin{align} S = \Pr[h_{D+1} \leq \min_{i\in[D]} h_i \mid h_1,\ldots,h_D] \end{align}\] Now we can compute the expectation of \(S\): \[\begin{align*} \mathbb{E}_{h_1,\ldots,h_D}[S] &= \mathbb{E}_{h_1,\ldots,h_D}[\Pr[h_{D+1} \leq \min_{i\in[D]} h_i \mid h_1,\ldots,h_D]]\\ &= \Pr_{h_1,\ldots,h_{D+1}}[h_{D+1} \leq \min_{i\in[D]} h_i] \\ &= \frac1{D+1}. \end{align*}\] The first equality follows by our alternative definition of \(S\), the second equality follows by the first observation in the proof, the final equality follows because each \(h_i\) is equally likely to be the minimum of all \(D+1\).

We now know the expectation of \(S\) (twice) but, in order to apply Chebyshev’s inequality, we need to also bound the variance. Recall that

\[ \textrm{Var}(S) = \mathbb{E}[S^2] - \mathbb{E}[S]^2. \]

We know \(\mathbb{E}[S]\) but we don’t yet know \(\mathbb{E}[S^2]\).

**Lemma:** \(\mathbb{E}[S^2] =
\frac{2}{(D+1)(D+2)}\).

**Calculus Proof:** Using a calculus proof similar to
the one for the expectation, we know \[
\mathbb{E}[S^2] = \int_{s=0}^1 \Pr(S^2 \geq s) ds
= \int_{s=0}^1 \Pr(S \geq \sqrt{s}) ds
= \int_{s=0}^1 (1-\sqrt{s})^D d s.
\] Using the WolframAlpha query here,
we can find that the last equality is \[
\frac{2}{(D+1)(D+2)}.
\]

Again, the calculus proof is correct but it doesn’t give us a deeper understanding.

**Proof from the Book:** We can apply the same machinery
that we used in the prior proof from the book. The key observation now
is that \(S^2\) is the probability that
the next *two* hashed values are both less than the minimum of
the first \(D\) hashed values.
Formally, \[
S^2 = \Pr(\max(h_{D+1}, h_{D+2}) \leq \min(h_1, \ldots, h_D) | h_1,
\ldots, h_D).
\] What’s the probability that both of the next hashed values are
less the minimum of the first \(D\)
hashed values? Well, that’s the probability that *either* the
first hashed value is less than the minimum of the first \(D+1\) hashed values and the second hashed
value is less than all \(D+2\) hashed
values *or* the second hashed value is less than the minimum of
the first \(D+1\) hashed values and the
first hashed value is less than all \(D+2\) hashed values. The probability of the
first event is \(\frac{1}{(D+1)(D+2)}\)
and the probability of the second event is \(\frac{1}{(D+1)(D+2)}\).

Using the same analysis from the prior proof from the book, we know that \[\begin{align} \mathbb{E}_{h_1, \ldots, h_D}[S^2] &= \Pr_{h_1, \ldots, h_{D+2}}(\max(h_{D+1}, h_{D+2}) \leq \min(h_1, \ldots, h_D)) \\ &= \frac{2}{(D+1)(D+2)}. \end{align}\]

Remember our goal is to compute the variance. With our expressions for \(\mathbb{E}[S]\) and \(\mathbb{E}[S^2]\), we know \[ \textrm{Var}(S) = \frac{2}{(D+1)(D+2)} - \frac{1}{(D+1)^2} \leq \frac{1}{(D+1)^2}. \]

Let’s try applying Chebyshev’s inequality. We know \(\mathbb{E}[S] = \frac{1}{D+1} = \mu\) and \(\textrm{Var}(S) \leq \frac{1}{(D+1)^2} = \mu^2\). The bound we want is \(\Pr(|S - \mu | \geq \epsilon \mu) \leq \delta\) where \(\delta\) is a (small) probability of failure. Instead, Chebyshev’s inequality gives \[ \Pr(|S - \mu| \geq \epsilon \mu) = \Pr( |S - \mu| \geq \epsilon \sigma ) \leq \frac{1}{\epsilon^2}. \] But \(\epsilon\) is a number less than \(1\) so the bound is vacuous!!

Just as we repeated the core subroutine in the count-min algorithm, we’ll repeat core subroutine in this algorithm. The new algorithm is to choose \(k\) random hash functions \(h_1, \ldots, h_k: \mathcal{U} \rightarrow [0, 1]\). We’ll keep \(k\) independent sketches of the minimum value \(S_1, \ldots, S_k\) that are all initialized to \(1\). Then, when we see item \(x_i\), we update \[S_j \gets \min(S_j, h_j(x_i))\] for every \(j \in \{1, \ldots, k\}\). At the end of the stream, we take the average \(S = (S_1 + \ldots + S_k)/k\) and return the estimate \(\hat{D} = \frac{1}{S} - 1\).

Our new algorithm is an example of a general strategy for variance reduction. We repeat many independent trials and take the mean. Given i.i.d. (independent, identically distributed) random variables, \(X_1, \ldots, X_k\) with mean \(\mu\) and variance \(\sigma^2\), we know that

\[ \mathbb{E} \left[ \frac{1}{k} \sum_{i=1}^k X_i \right] = \frac{1}{k} \sum_{i=1}^k \mathbb{E} \left[ X_i \right] = \mu \] and

\[ \textrm{Var}\left[ \frac{1}{k} \sum_{i}^k X_i\right] = \frac{1}{k^2}\sum_{i}^k \textrm{Var} \left[X_i\right] = \frac{\sigma^2}{k} \] where we used linearity of expectation and linearity of variance (since each \(X_i\) is independent).

Applying the variance reduction analysis to our algorithm, we have that \(\mathbb{E}[S] = \mu\) as before but now \(\textrm{Var}(S) \leq \frac{\mu^2}{k}\). Then Chebyshev’s inequality gives \[ \Pr\left(|S - \mu| \geq c \frac{\mu}{\sqrt{k}} \right) \leq \frac{1}{c^2}. \] Setting \(c = \frac{1}{\sqrt{\delta}}\) and \(k = \frac{1}{\epsilon^2 \delta}\), we have \[ \Pr(|S - \mu| \geq \epsilon \mu) \leq \delta. \]

We’re *nearly* done except that we have only bounded \(S\) around its expectation. We really need
to bound \(\hat{D}\) around its
expectation. We know that with probability \(\delta\), \[
(1-\epsilon) \mathbb{E}[S]
\leq S \leq (1+\epsilon) \mathbb{E}[S].
\] Inverting gives, \[
\frac{1}{(1+\epsilon) \mathbb{E}[S]} \leq \frac{1}{S} \leq
\frac{1}{(1-\epsilon) \mathbb{E}[S]}.
\] We can use the fact (easily verified on a graphing calculator
like Desmos)
that \(\frac{1}{1+\epsilon} \leq 1 -
2\epsilon\) and \(\frac{1}{1-\epsilon}
\geq 1 + 2\epsilon\). We subtract 1 from each term to get \[
(1-2\epsilon) \frac{1}{\mathbb{E}[S]} - 1
\leq \frac{1}{S} - 1 \leq (1+2\epsilon)
\frac{1}{\mathbb{E}[S]} - 1.
\] Now, since \(\frac{1}{\mathbb{E}[S]}
\leq 1\), we know that \(\frac{1}{\mathbb{E}[S]} \leq 2 \left(
\frac{1}{\mathbb{E}[S]} - 1 \right)\) and so \(2\epsilon \frac{1}{\mathbb{E}[S]} \leq 4 \epsilon
\left( \frac{1}{\mathbb{E}[S]} - 1 \right)\). Then \[
(1-4 \epsilon)\left( \frac{1}{\mathbb{E}[S] - 1} \right)
\leq \frac{1}{S} - 1
\leq (1+4\epsilon) \left( \frac{1}{\mathbb{E}[S]} - 1 \right).
\] Since \(D = \frac{1}{\mathbb{E}[S]}
- 1\) and \(\hat{D} = \frac{1}{S} -
1\), we have that with probability \(\delta\), \[
(1-4\epsilon) D \leq \hat{D} \leq (1+4\epsilon) D.
\] We lose a small factor on the \(\epsilon\) but this goes into the big-O
notation. So our final space complexity is \(O(k \log D) = O\left(\frac{\log D}{\epsilon^2
\delta}\right)\). The \(\log D\)
factor comes from the fact that each of the \(k\) hash functions needs \(\log D\) bits to represent the input.
Impressively, the space complexity has no *linear* dependence on
the number of distinct elements \(D\).
We know that the \(\frac{1}{\epsilon^2}\) dependence cannot be
improved but we can get a better bound depending on \(\log(1/\delta)\) instead of \(1/\delta\) by using a stronger
concentration inequality.

For our analysis, we assumed that we could hash to real numbers on \([0,1]\). In practice, the hash function maps to bit vectors and the estimate \(S\) is the number of leading zeros in the bit vector. Intuitively, the more distinct hashes we see, the higher we expect the maximum number of leading zeros to be.

While the space complexity is very small, the true benefit of the
algorithm is that it can be implemented in a *distributed*
setting. The estimates \(S_i\) from
each machine can be sent to a central location and the final \(S = \min_i S_i\) can be computed there. Use
cases of the algorithm include counting the number of distinct users in
New York City that made at least one search containing the word ‘car’ in
the last month or counting the number of distinct subject lines in
emails sent by users that have registered in the last week (to detect
spam). Answering such a query requires a distributed linear scan over
the database that can be done in less than two seconds using
Google’s implementation.

Suppose Google answers map search queries using \(n\) servers \(A_1, \ldots, A_n\). Given a query like “New York to Rhode Island”, a common practice is to choose a random hash function \(h: \mathcal{U} \rightarrow \{1, \ldots, q\}\) and to route this query to the server corresponding to its hashed value.

Our goal is to ensure that requests are distributed evenly, so no one server gets loaded with too many requests. We want to avoid downtime and slow responses to clients.

Why should we use a hash function instead of just distributing requests randomly? Well, if a query has already been answered on a server, we want to direct the same query to it again so we don’t have to recompute the same answer on different servers.

Suppose we have \(n\) servers and \(m\) requests \(x_1, \ldots, x_m\). Let \(s_i\) be the number of requests sent to server \(i \in \{1, \ldots, n\}\). Formally, \[ s_i = \sum_{j=1}^m \mathbb{1}[h(x_j)=1]. \] Our goal is to understand the value of the maximum load on any server, which can be written as the random variable \[ S = \max_{i \in \{1, \ldots, n\}} s_i. \]

A good first step in any analysis of random variables is to think about expectations. If we have \(n\) servers and \(m\) requests, for any \(i \in \{1, \ldots, n\}\), \[ \mathbb{E}[s_i] = \sum_{j=1}^m \mathbb{E}\left[\mathbb{1}[h(x_j)=1]\right] = \frac{m}{n}. \] But it’s very unclear what the expectation of \(S\) is. In particular, \(\mathbb{E}[S] \neq \max_{i \in \{1, \ldots, n\}} \mathbb{E}[s_i]\).

Can you convince yourself that for two random variables \(A\) and \(B\), \(\mathbb{E}[\max(A,B)] \neq \max(\mathbb{E}[A], \mathbb{E}[B])\)? One example to think about is when \(A\) and \(B\) are independent variables that are each equally likely to be \(1\) and \(0\).

In order to reduce notation and keep the math simple, let’s assume that \(m=n\). That is, we assume that we have exactly the same number of servers and requests. As we did in the last problem, we’ll continue to assume we have access to a uniformly random hash function \(h\) so that \(\Pr(h(x) = h(y))=\frac{1}{m}\).

We can visualize the load balancing problem in the following figure: Each request is placed in a server uniformly and at random.

We know that the expected number of requests per server is \(\mathbb{E}[s_i] = \frac{m}{n} = 1\) under our assumption that we have the same number of requests and servers. We would like to prove \[ \Pr\left( \max_{i} s_i \geq C \right) \leq \frac{1}{10} \] where \(C\) is a small value (much smaller than \(n\)). Putting it another way, we would like to prove \[ \Pr\left( (s_1 \geq C) \cup (s_2 \geq C) \cup \ldots \cup (s_n \geq C)\right) \leq \frac{1}{10}. \] Notice these statements are equivalent since the max is greater than \(C\) if at least one of the values is greater than \(C\).

Whenever we look at the union of different events, we should immediately think of the union bound! Recall that the union bound allows us to nicely upper bound the union of events even when the events have complicated and dependent dynamics.

**Union Bound:** For any events \(A_1, \ldots, A_n\), \[
\Pr(A_1 \cup \ldots \cup A_n) \leq \Pr(A_1) + \ldots + \Pr(A_n).
\]