Powered by repeated innovations in chip manufacturing, computers have grown exponentially more powerful over the last several decades. As a result, we have access to unparalleled computational resources and data. For example, a single NASA satellite collects 20 terabytes of satellite images, more than 8 billion searches are made on Google, and estimates suggest the internet creates more than 300 million terabytes of data every single day. Simultaneously, we are quickly approaching the physical limit of how many transistors can be packed on a single chip. In order to learn from the data we have and continue expanding our computational abilities into the future, fast and efficient algorithms are more important than ever.
At first glance, an algorithm that performs only a few operations per item in our data set is efficient. However, these algorithms can be too slow when we have lots and lots of data. Instead, we turn to randomized algorithms that can run even faster. Randomized algorithms typically exploit some source of randomness to run on only a small part of the data set (or use only a small amount of space) while still returning an approximately correct result.
We can run randomized algorithms in practice to see how well they work. But we also want to prove that they work and understand why. Today, we will solve two problems using randomized algorithms. Before we get to the problems and algorithms, we’ll build some helpful probability tools.
Consider a random variable
The expected value tells us where the random variable is on average
but we’re also interested in how closely the random variable
concentrates around its expectation. The variance of a random variable
is
There are a number of useful facts about the expected value and variance. For example,
Once we have defined random variables, we are often interested in
events defined on their outcomes. Let
If information about event
Let’s figure out whether the event
We’ve been talking about events defined on random variables, but
we’ll also be interested in when random variables are independent.
Consider random variables
One of the most powerful theorems in all of probability is the linearity of expectation.
Theorem: Let
Proof: Observe that
There are also several other useful facts about the expected value and variance.
Fact 1: When
Proof: Observe that
Fact 2: Consider a random variable
Proof: Observe that
Fact 3: When
Proof: Observe that
We’ll pose a problem that has applications in ecology, social networks, and internet indexing. However, while efficiently solving the problem is useful, our purpose is really to gain familiarity with linearity of expectation and learn Markov’s inequality.
Suppose you run a website that is considering contracting with a
company to provide CAPTCHAs for login verification. The company claims
to have a database with
An obvious approach is to keep calling ther API until we find a million unique CAPTCHAs. Of course, the issue is that we have to make at least a million API calls. That’s not so good if we care about efficiency, they charge us per call, or the size they claim to have in their database is much bigger than a million.
A more clever approach is to call their API and count duplicates.
Intuitively, the larger their database, the fewer duplicates we expect
to see. Define a random variable
When a random variable can only be 0 or 1, we call it an
indicator random variable. Indicator random variables have the
special property that their expected value is the probability they are
1. We can define the total number of duplicates
We can calculate the expected number of duplicates using linearity of expectation.
Since
Suppose we take
Well, the expectation would be
Concentration inequalities are a powerful tool in the analysis of randomized algorithms. They tell us how likely it is that a random variable differs from its expectation.
There are many concentration inequalities. Some apply in general and some apply only under special assumptions. The concentration inequalities that apply only under special assumptions tend to give stronger results. We’ll start with one of the most simple and general concentration inequalities.
Theorem: For any non-negative random variable
Proof: We’ll prove the inequality directly. By the
definition of expectation, we have
Now let’s apply Markov’s inequality to our set size estimation
problem. Since the number of duplicates
In practice, many of the set size estimation problems are slightly
different. Instead of checking a claim about the set size, we want to
estimate the set size directly. Notice that we computed
Claim: If we make
The frequent items problem is to identify the items that appear most often in a stream. For example, we may want to find the most popular products on Amazon, the most watched videos on YouTube, or the most searched queries on Google. We process each item as it appears in the stream and, at any moment, our goal is to return the most frequent items without having to scan through a database.
The obvious algorithm for the frequent items problem is to store each item and the number of times we’ve seen it. The issue is that we would then need space linear in the number of unique items. If the items are pairs of products, for example, then the space scales quadratically. If the items are triplets of videos, then the space scales cubically. Clearly, we need a more efficient algorithm. It turns out that we can’t solve the problem exactly with less space but we can solve the problem approximately.
Consider a stream of
every item that appears at least
only items that appear at least
We’ll see how to use a randomized hashing algorithm to solve the
problem. The algorithm addresses the slightly different problem of
estimating the frequency of any item
Our goal is to return an estimate
The key ingredient of the algorithm is hash functions.
Let the hash function
Definition: A hash function
Notice that the independence condition implies
In general, it is not possible to efficiently implement uniform
random hash functions. But, for our application, we only need a
universal hash function which can be implemented efficiently.
Let
With handy hash functions, we’re now ready to describe the algorithm.
We first choose a random hash function
In the figure, Amazon products appear one after the other in a stream. We hash the fish tank to the second cell in the array, the basketball to the first cell, and so on. Crucially, when we see the fish tank again, we hash it to the same cell.
After processing the stream, our estimate for the frequency of item
Formally, our estimate for the frequency is
Let’s take the expectation of the error term:
We have a non-negative random variable and a bound on its expectation, let’s apply Markov’s inequality to the error term
A common approach in many randomized algorithms is to boost the
success probability by repeating the core subroutine. In our case, we’ll
maintain
As depicted in the figure, each item gets hashed into one cell in
every array using the respective hash function. So the update on item
Then, when we’re computing our estimate for the frequnecy of an item,
we look at each cell it appears in and take the minimum. Formally,
For every array
Putting it all together, we just proved that count-min sketch lets us
estimate the frequency of each item in the stream up to error
However, this guarantee is only for a single item
Another simple and powerful tool in randomized algorithm design is the union bound. The union bound allows us to easily analyze the complicated dynamics of many random events.
Lemma: For any random events
Proof: We’ll give a “proof by picture”. Each circle
represents the outcome space corresponding to event
Let’s apply the union bound to the total failure probability of our
estimate. The algorithm fails if