Computational Geomatory
Computational Geomatory
Sariel Har-Peled¬
Department of Computer Science; University of Illinois; 201 N. Goodwin Avenue; Urbana, IL, 61801, USA;
[email protected]; http://www.uiuc.edu/~sariel/. Work on this paper was partially supported by a NSF
CAREER award CCR-0132901.
2
Contents
Contents 3
Preface 11
3
3 Well Separated Pairs Decomposition 39
3.1 Well-separated pairs decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 The construction algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Applications of WSPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Spanners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Approximating the Minimum Spanning Tree . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 Approximating the Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4 Closest Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.5 All Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.5.1 The bounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.5.2 All nearest neighbor - the unbounded spread case . . . . . . . . . . . . . 47
3.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4
6 Sampling and the Moments Technique 77
6.1 Vertical Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.1.1 Backward Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 General Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.1 Analyzing the RIC Algorithm for Vertical Decomposition . . . . . . . . . . . . . . 82
6.3.2 Cuttings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5 Proof of Lemma 6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5
10.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6
15 ANN in High Dimensions 143
15.1 ANN on the Hypercube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
15.1.1 Hypercube and Hamming distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
15.1.2 Constructing NNbr for the Hamming cube . . . . . . . . . . . . . . . . . . . . . . . 143
15.1.3 Construction the near-neighbor data-structure . . . . . . . . . . . . . . . . . . . . . 144
15.2 LSH and ANN in Euclidean Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
15.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
15.2.2 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
15.2.3 ANN in High Dimensional Euclidean Space . . . . . . . . . . . . . . . . . . . . . . 146
15.2.3.1 Low quality HST in high dimensional Euclidean space . . . . . . . . . . . 146
15.2.3.2 The overall result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
15.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7
19.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
23 Duality 183
23.1 Duality of lines and points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
23.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
23.1.1.1 Segments and Wedges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
23.1.1.2 Convex hull and upper/lower envelopes . . . . . . . . . . . . . . . . . . . 184
23.2 Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
23.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
23.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
23.4.1 Projective geometry and duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
23.4.2 Duality, Voronoi Diagrams and Delaunay Triangulations. . . . . . . . . . . . . . . . 186
8
24 Finite Metric Spaces and Partitions 189
24.1 Finite Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
24.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
24.2.1 Hierarchical Tree Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
24.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
24.3 Random Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
24.3.1 Constructing the partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
24.3.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
24.4 Probabilistic embedding into trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
24.4.1 Application: approximation algorithm for k-median clustering . . . . . . . . . . . . 192
24.5 Embedding any metric space into Euclidean space . . . . . . . . . . . . . . . . . . . . . . . 193
24.5.1 The bounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
24.5.2 The unbounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
24.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
24.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Bibliography 203
Index 210
9
10
Preface
This manuscript is a collection of class notes on geometric approximation algorithms. It represents the book
I wish I could have read when I started doing research.
There are without doubt errors and mistakes in the text and I would like to know about them. Please
email me about any of them you find.
On optimality. Since this text is intended to explain key ideas and algorithms, I had consistently avoided
the trap of optimality, when the optimal known algorithm is way more complicated and less insightful
than some other algorithm. A reference to the optimal known algorithm would be usually described in the
bibliographical notes.
A note on style. Injected into the text are random comments (usually as footnotes) that have nothing
directly to do with the text. I hope these comments make the text more enjoyable to read, and I added them,
on the spur of the moment, to amuse myself. Some readers might find these comments irritating and vain,
and I humbly ask these readers to ignore them.
Acknowledgements
I had the benefit of interacting with numerous people on the work in this book. Sometime directly or
indirectly. There is something mundane and predictable in enumerating a long list of people that helped and
contributed to this work, but this in no way diminish their contribution and their help.
As such, I would like to thank the students in the class for their input, which helped in discovering
numerous typos and errors in the manuscript. Furthermore, the content was greatly effected by numerous
insightful discussions with Jeff Erickson and Edgar Ramos. Other people qthat provided comments or
insights, answered nagging emails from me, for which I am thankful for include Bernard Chazelle, John
Fischer, Samuel Hornus, Piotr Indyk, Mira Lee, Jirka Matoušek, and Manor Mendel.
I am sure that other people had contributed to this work, and I had forgot to mention them – they have
my thanks and apologies.
11
Copyright
This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative
Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
— Sariel Har-Peled
April 2007, Daejeon, Korea
12
Chapter 1
The Peace of Olivia. How sweat and peaceful it sounds! There the great powers noticed for the first time that the
land of the Poles lends itself admirably to partition.
– The tin drum, Gunter Grass
In this chapter, we are going to discuss two basic geometric algorithms. The first one, computes the
closest pair among a set of n points in linear time. This is a beautiful and surprising result that exposes the
computational power of using grids for geometric computation. Next, we discuss a simple algorithm for
approximating the smallest enclosing ball that contains k points of the input. This at first looks like a bizarre
problem, but turns out to be a key ingredient to our later discussion.
1.1 Preliminaries
For a real positive number r and a point p = (x, y) in IR2 , define Gr (p) to be the grid point (bx/rc r, by/rc r).
We call r the width of the grid Gr . Observe that Gr partitions the plane into square regions, which we call
grid cells. Formally, for any i, j ∈ Z, the intersection of the half-planes x ≥ ri, x < r(i + 1), y ≥ r j and
y < r( j + 1) is said to be a grid cell. Further we define a grid cluster as a block of 3 × 3 contiguous grid cells.
Note, that every grid cell C of Gr , has a unique ID; indeed, let p = (x, y) be any point in C, and consider
the pair of integer numbers idC = id(p) = (bx/rc , by/rc). Clearly, only points inside C are going to be
mapped to idC . This is very useful, since we store a set P of points inside a grid efficiently. Indeed, given a
point p, compute its id(p). We associate with each unique id a data-structure that stores all the points falling
into this grid cell (of course, we do not maintain such data-structures for grid cells which are empty). So,
once we computed id(p), we fetch the data structure associated with this cell, by using hashing. Namely,
we store pointers to all those data-structures in a hash table, where each such data-structure is indexed by its
unique id. Since the ids are integer numbers, we can do the hashing in constant time.
Assumption 1.1.1 Throughout the discourse, we assume that every hashing operation takes (worst case)
constant time. This is quite a reasonable assumption when true randomness is available (using for example
perfect hashing [CLRS01]).
For a point set P, and parameter r, the partition of P into subsets by the grid Gr , is denoted by Gr (P).
More formally, two points p, q ∈ P belong to the same set in the partition Gr (P), if both points are being
mapped to the same grid point or equivalently belong to the same grid cell.
13
1.2 Closest Pair
We are interested in solving the following problem:
Problem 1.2.1 Given a set P of n points in the plane, find the pair of points closest to each other. Formally,
return the pair of points realizing CP(P) = min p,q∈P kp − qk.
Lemma 1.2.2 Given a set P of n points in the plane, and a distance r, one can verify in linear time, whether
CP(P) < r, CP(P) = r, or CP(P) > r.
Proof: Indeed, store the points of P in the grid Gr . For every non-empty grid cell, we maintain a linked
list of the points inside it. Thus, adding a new point p takes constant time. Indeed, compute id(p), check if
id(p) already appears in the hash table, if not, create a new linked list for the cell with this ID number, and
store p in it. If a data-structure already exist for id(p), just add p to it.
This takes O(n) time. Now, if any grid cell in Gr (P) contains more than, say, 9 points of p, then it must
be that the CP(P) < r. Indeed, consider a cell C containing more than nine points of P, and partition C into
3 × 3 equal squares. Clearly, one of those squares
√ must contain two points of P, and let C 0 be this square.
0
Clearly, the diameter of C = diam(C)/3 = r + r2 /3 < r. Thus, the two (or more) points of P in C 0 are at
2
Theorem 1.2.3 For set P of n points in the plane, one can compute the closest pair of P in expected linear
time.
Proof: Pick a random permutation of the points of P, let hp1 , . . . , pn i be this permutation. Let r2 =
kp1 p2 k, and start inserting the points into the data structure of Lemma 1.2.2. In the ith iteration, if ri = ri−1 ,
then this insertion takes constant time. If ri < ri−1 , then we rebuild the grid and reinsert the points. Namely,
we recompute Gri (Pi ).
14
To analyze the running time of this algorithm, let Xi be the indicator variable which is 1 if ri , ri−1 , and
0 otherwise. Clearly, the running time is proportional to
X
n
R=1+ (1 + Xi · i) .
i=2
by linearity of expectation and since for indicator variable Xi , we have E[Xi ] = Pr[Xi = 1].
Thus, we need to bound Pr[Xi = 1] = Pr[ri < ri−1 ]. To bound this quantity, fix the points of Pi , and
randomly permute them. A point q ∈ Pi is called critical, if CP(Pi \ {q}) > CP(Pi ). If there are no critical
points, then ri−1 = ri and then Pr[Xi = 1] = 0. If there is one critical point, then Pr[Xi = 1] = 1/i, as this is
the probability that this critical point, would be the last point in the random permutation of Pi .
If there are two critical points, and let p, q be this unique pair of points of Pi realizing CP(Pi ). The
quantity ri is smaller than ri−1 , one if either p or q are pi . But the probability for that is 2/i (i.e., the
probability in a random permutation of i objects, that one of two marked objects would be the last element
in the permutation).
Observe, that there can not be more than two critical points. Indeed, if p and q are two points that
realizing the closest distance, than if there is a third critical point r, then CP(Pi \ {r}) = kpqk, and r is not
critical.
We conclude that
Xn Xn
2
E[R] = n + i · Pr[Xi = 1] ≤ n + i · ≤ 3n,
i=2 i=2
i
and the expected running time is O(E[R]) = O(n).
Theorem 1.2.3 is a surprising result, since it implies that uniqueness (i.e., deciding if n real numbers are
all distinct) can be solved in linear time. However, there is a lower bound of Ω(n log n) on uniqueness, using
the comparison model. This reality dysfunction can be easily explained once one realizes that the computa-
tion of Theorem 1.2.3 is considerably stronger, using hashing, randomization, and the floor function.
Indeed, compute the median in the x-order of the points of P, split P into two sets, and recurse on each set, till the number of
points in a subproblem is of size ≤ k/4. We have T (n) = O(n) + 2T (n/2), and the recursion stops for n ≤ k/4. Thus, the recursion
tree has depth O(log(n/k)), which implies running time O(n log(n/k)).
15
Consider the (non-uniform) grid G induced by the lines h1 , . . . , hm
and v1 , . . . , vm . Let X be the set of all intersection points of G. We
claim that Dopt (P, k) contains at least one point of X. Indeed, con-
sider the center u of Dopt (P, k), and let c be the cell of G that contains
u. Clearly, if Dopt (P, k) does not cover any of the four vertices of c, c
then it can cover only points in the vertical strip of G that contains
c, and only points in the horizontal strip of G that contains c. See Fig-
Dopt
ure 1.1. However, each such strip contains at most k/4 points. It follows
that Dopt (P, k) contains at most k/2 points of P, a contradiction. Thus,
Dopt (P, k) must contain a point of X. For every point p ∈ X, compute
the smallest circle centered at p that contains k points of P. Clearly,
for a point q ∈ X ∩ Dopt (P, k), this yields the required 2-approximation.
Figure 1.1: If the disk Dopt (P, k)
Indeed, the disk of radius 2Dopt (P, k) centered at q contains at least k
does not contain any vertex of the
points of P since it also covers Dopt (P, k). We summarize as follows:
cell c, then it does not cover any
shaded area. As such, it can con-
Lemma 1.3.1 Given a set P of n points in the plane, and parameter
tain at most k/2 points, since the
k, one can compute in O(n(n/k)2 ) deterministic time, a circle D that
vertical and horizontal strips con-
contains k points of P, and radius(D) ≤ 2ropt (P, k).
taining c, each has at most k/4
points
Corollary 1.3.2 Given a set of P of n points and a parameter k = Ω(n), one canofcompute
P insideinit.linear time, a
circle D that contains k points of P and radius(D) ≤ 2ropt (P, k).
Remark 1.4.1 For a point set P of n points, the radius r returned by the algorithm of Lemma 1.3.1 is the
distance between a vertex of the non-uniform grid and a point of P. As such, a grid Gr computed using this
distance is one of O(n3 ) possible grids. Indeed, a circle is defined by the distance between a vertex of the
non-uniform grid of Lemma 1.3.1, and a point of P. A vertex of such a grid is determined by two points of
P, through which the vertical and horizontal line passes. Thus, there are O(n3 ) such triples.
Let gdr (P) denote the maximum number of points of P mapped to a single point by the mapping Gr .
Define depth(P, r) to be the maximum number of points of P that a circle of radius r can contain.
Lemma 1.4.2 For any point set P, and r > 0, we have: (i) for any real number A > 0, it holds
16
Proof: (i) Consider the disk D of radius Ar realizing depth(P, Ar), and let D0 be the disk of radius
(A + 1)r having the same center as D. For every point p ∈ V = P ∩ D, place a disk around it of radius r,
and let S denote the resulting set of disks. Since every disk of S has area πr2 , and they are all contained
in D0 , which is of area π(A + 1)2 r2 , it follows that there must be a point p inside D0 which is contained in
|V|πr2
l depth(P,Ar) m
µ = π(A+1) 2 r2 = (A+1)2
disks. This means, that the disk D of radius r centered at p contains at least µ
points of P. Now, µ ≤ |D ∩ P| ≤ depth(P, r). Thus, depth( P,Ar)
(A+1)2
≤ µ ≤ depth(P, r), as claimed.
(iv) Consider a (closed) disk D of radius r, and let c be its center. If c is in the interior of a grid cell C,
then the claim easily holds, since D can intersect only C or the cells adjacent to C. Namely, D is contained
in the cluster centered at C. The problematic case, is when c is on the grid boundaries. Since grid cells are
closed on one side, and open on the other, it is easy to verify that the claim holds again by careful and easy
case analysis, which we will skip here.
(ii) Consider the grid cell C√ of Gr (P) that realizes
√ gdr (P), and let c be a point placed in the center of
C. Clearly, a disk D of radius r2 + r2 /2 = r/ 2 centered at c, would cover completely the cell C. Thus,
gdr (P) ≤ |D ∩ P| ≤ depth(P, r). As for the other inequality, observe that the disk D realizing depth(P, r),
can intersect at most 9 cells of the grid Gr , by (iv). Thus, depth(P, r) = |D ∩ P| ≤ 9gdr (P).
(iii) Let C be the grid cell of Gr realizing gdr (P). Place 4 points, at the corners of C, and one point in the
center of C. Placing a disk of radius ropt (P, k) at each of those points, completely covers C, as can be easily
verified (since the side length of C is at most 2ropt (P, k)). Thus, |P ∩ C| = gdr (P) ≤ 5 depth(P, ropt (P, k)) =
5k.
1.4.1.1 Description
As in the previous sections, we construct a grid which partitions the points into small (O(k) sized) groups.
The key idea behind speeding up the grid computation is to construct the appropriate grid over several
rounds. Specifically, we start with a small set of points as seed and construct a suitable grid for this subset.
Next, we incrementally insert the remaining points, while adjusting the grid width appropriately at each
step.
Let P = (P1 , . . . , Pm ) be a u-gradation of P (see Definition 1.4.3), where u = max(k, n/ log n). The
sequence P can be computed in expected linear time as shown in Lemma 1.4.4.
Since |P1 | = O(k), we can compute r1 , in O(|P1 | (|P1 | /k)2 ) = O(k) = O(n) time, using Lemma 1.3.1.
17
Grow(Pi ,ri−1 ,k)
Output: ri
begin
Gi−1 ← Gri−1 (i)
for every grid cluster c ∈ Gi−1 with |c ∩ Pi | ≥ k do
Pc ← c ∩ Pi
rc ← ApproxHeavy(c, k)
// ApproxHeavy is the algorithm of Lemma 1.3.1
We have ropt (c, k) ≤ rc ≤ 2ropt (c, k),
The algorithm now works in m rounds, where m is the length of the sequence P. At the end of the ith
round, we have a distance ri such that gdri (Pi ) ≤ 5k, and there exists a grid cluster in Gri containing more
than k points of Pi and ropt (Pi , k) ≤ ri .
At the ith round, we first construct a grid Gi−1 for points in Pi using ri−1 as grid width. We know that
there is no grid cell containing more than 5k points of Pi−1 . As such, intuitively, we expect every cell of
Gi−1 to contain at most 10k points of Pi , since Pi−1 ⊆ Pi was formed by choosing each point of Pi into Pi−1
with probability 1/2. (This is of course too good to be true, but something slightly weaker does hold.) Thus
allowing us to use the slow algorithm of Lemma 1.3.1 on those grid clusters. Note that, for k = Ω(n), the
algorithm of Lemma 1.3.1 runs in linear time, and thus the overall running time is linear.
The algorithm used in the ith round is more concisely stated in Figure 1.2. At the end of the m rounds we
have rm , which is a 2-approximation to the radius of the optimal k enclosing circle of Pm = P. The overall
algorithm is summarized in Figure 1.3.
1.4.1.2 Analysis
Lemma 1.4.5 For i = 1, . . . , m, we have ropt (Pi , k) ≤ ri ≤ 2ropt (Pi , k), and the heaviest cell in Gri (Pi )
contains at most 5k points of Pi .
Proof: Consider the optimal circle Di that realizes ropt (Pi , k). Observe that there is a cluster c of Gri−1 that
contains Di , as ri−1 ≥ ri . Thus, when Grow handles the cluster c, we have Di ∩ Pi ⊆ c. The first part of the
lemma then follows from the correctness of the algorithm of Lemma 1.3.1.
As for the second part, observe that any grid cell of width ri can be covered with 5 circles of radius ri /2,
and ri /2 ≤ ropt (Pi , k). It follows that each grid cell of Gri (i) contains at most 5k points.
Now we proceed to upper-bound the number of cells of Gri−1 that contains “too many” points of Pi .
Since each point of Pi−1 was chosen from Pi with probability 1/2, we can express this bound as a sum of
independent random variables, and bound this using tail-bounds.
Definition 1.4.6 For a point set P, and parameters k and r, the excess of Gr (P) is
X $ |c ∩ P| %
E(P, r) = ,
c∈Cells(G )
50k
r
18
LinearApprox(P, k)
Output: r - a 2-approximation to ropt (P, k)
begin
Compute a gradation {P1 , . . . , Pm } of P as in Lemma 1.4.4
r1 ← ApproxHeavy(P1 , k)
// ApproxHeavy is the algorithm of Lemma 1.3.1
// which outputs a 2-approximation
for i ← 2 to m do
r j ← Grow(Pi , ri−1 , k)
Remark 1.4.7 The quantity 100k · E(P, r) is an upper bound on the number of points of P in an heavy cell
of Gr (P), where a cell of Gr (P) is heavy if it contains at least 50k points.
Lemma 1.4.8 For any positive real t, the probability that Gri−1 (i) has excess E(Pi , ri−1 ) ≥ α = t + 3 lg n ,
is at most 2−t .
Proof: Let G be the set of O(n3 ) possible grids that might be considered by the algorithm (see Remark 1.4.1),
and fix a grid G ∈ G with excess M = E(Pi , κ(G)) ≥ α, where κ(G) is the sidelength of a cell of G.
Let U = {Pi ∩ c} c ∈ G, |Pi ∩ c| ≥ 50k be all the heavy cells in G(i). Furthermore, let
[
V= P(X, 50k),
X∈U
where P(X, ν) denotes an arbitrary partition of the set X into disjoint subsets such that each one of them
contains ν points, except maybe the last subset that might contain between ν and 2ν − 1 points.
It is clear that |V| = E(Pi , `). From the Chernoff inequality, for any S ∈ V, we have µ = E[|S ∩ Pi−1 |] ≥
25k, and setting δ = 4/5 we have
! !
δ2 25k(4/5)2 1
Pr[|S ∩ Pi−1 | ≤ 5k] ≤ Pr |S ∩ Pi−1 | ≤ (1 − δ)µ < exp −µ ≤ exp − < .
2 2 2
Furthermore, since G = Gri−1 implying that each cell of G(i − 1) contains at most 5k points. Thus we have
Y 1 1 1
Pr Gri−1 = G ≤ Pr[|S ∩ Pi−1 | ≤ 5k] ≤ |V| = M ≤ α .
S ∈V
2 2 2
19
Since there are n3 different grids in G, we have
[
Pr[E(Pi , ri−1 ) ≥ α] = Pr (G = Gri−1 )
G∈G,
E(Pi ,κ(G))≥α
X 1 1
≤ Pr G = Gri−1 ≤ n3 α ≤ t .
G∈G,
2 2
E(Pi ,κ(G))≥α
We next bound the expected running time of the algorithm LinearApprox by bounding the expected
time spent in the ith iteration. In particular, let Y be the random variable which is the excess of Gri−1 (i). In
this case, there are at most Y cells which are heavy in Gri−1 (i), and each such cell contains at most O(Yk)
points. Thus, invoking the algorithm of ApproxHeavy on such a heavy cell 2 3
takes O(Yk·((Yk)/k)
) = O(Y k)
3 4
time. Overall, the running time of Grow, in the ith iteration, is T (Y) = O |Pi | + Y · Y k = O |Pi | + Y k .
For technical reasons, we need to consider the light and heavy cases separately to bound Y. So set
Λ = 3 lg n .
The Light Case: k < Λ. We have that the expected running time is proportional to
X
dn/ke h i X
n/k
O(|Pi |) + Pr[Y = t] T (t) = O(|Pi |) + Pr 0 ≤ Y ≤ Λ T (Λ) + Pr[Y = t] T (t)
t=0 t=Λ+1
X
n/k
1
≤ O(|Pi |) + T (Λ) + t
T (t + Λ)
t=1
2
X
n/k
(t + Λ)4 k
4
= O(|Pi |) + O(k log n) +
t=1
2t
= O |Pi | + k log4 n = O(|Pi |) ,
by Lemma 1.4.8 and since T (·) is a monotone increasing function.
Lemma 1.4.9 The probability that Gri−1 (i) has excess larger than t, is at most 2−t , for k ≥ Λ.
Proof: We use the same technique as in Lemma 1.4.8. By the Chernoff inequality, the probability that any
50k size subset of Pi would contain at most 5k points of Pi−1 , is less than
!
16 1 1
≤ exp −25k · · ≤ exp(−5k) ≤ 4 .
25 2 n
In particular, arguing as in Lemma 1.4.8, it follows that the probability that E(Pi , k, ri−1 ) exceeds t, is smaller
than n3 /n4t ≤ 2−t .
Thus, if k ≥ Λ, the expected running time of Grow, in the ith iteration, is at most
X !2 X∞ !2
|c ∩ P | tk 1
O |c ∩ Pi |
i
= O|Pi | + t · tk · · = O(|Pi | + k) = O(|Pi |) ,
c∈G
k t=1
k 2t
ri−1
by Lemma 1.4.9.
20
Overall Running Time Analysis Thus, by the above analysis and by Lemma 1.4.4, the total expected
P
running time of LinearApprox inside the inner loop is O i |Pi | = O(n). As for the last step, of computing
a 2-approximation, consider the grid Grm (P). Each grid cell contains at most 5k points, and hence each grid
cluster contains at most 45k points. Also the smallest k enclosing circle is contained in some grid cluster. In
each cluster that contain more than k points, we use the algorithm of Corollary 1.3.2 and finally output the
minimum over all the clusters. The overall running time is O((n/k)k) = O(n) for this step, since each point
belongs to at most 9 clusters.
Theorem 1.4.10 Given a set P of n points in the plane, and a parameter k, one can compute, in expected
linear time, a radius r, such that ropt (P, k) ≤ r ≤ 2ropt (P, k).
Once we compute r such that ropt (P, k) ≤ r ≤ 2ropt (P, k), using the algorithm of Theorem 1.4.10, we
apply an exact algorithm to each cluster of the grid Gr (P) which contains more than k points.
Matoušek presented such an exact algorithm [Mat95a], and it has running time of O(n log n + nk) and
space complexity O(nk). Since r is a 2 approximation to ropt (P, k), each cluster has O(k) points. Thus the
running time of the exact algorithm in each cluster is O(k2 ) and requires O(k2 ) space. The number of clusters
which contain more than k points is O(n/k). Hence the overall running time is O(nk), and the space used is
O(n + k2 ).
Theorem 1.4.11 Given a set P of n points in the plane and a parameter k, one can compute, in expected
O(nk) time, using O(n + k2 ) space, the radius ropt (P, k), and a circle Dopt (P, k) that covers k points of P.
1.6 Exercises
Exercise 1.6.1 (Compute clustering radius.) [10 Points]
Let C and P be two given sets of points in the plane, such that k = |C| and n = |P|. Let r =
max p∈P minc∈C kc − pk be the covering radius of P by C (i.e., if we place a disk of radius r around each
point of C all those disks covers the points of P).
(A) Give a O(n + k log n) expected time algorithm that outputs a number α, such that r ≤ α ≤ 10r.
(B) For ε > 0 a prescribed parameter, give a O(n+kε−2 log n) expected time algorithm that outputs a number
α, such that α ≤ r ≤ (1 + ε)α.
21
22
Chapter 2
In this chapter, we discuss quadtrees which is arguably one of the simplest and most powerful geometric
data-structure. We begin in Section 2.1 by giving a simple application of quadtrees and describe a clever
way for performing point-location queries quickly in such a quadtree. In Section 2.2, we describe how such
quadtrees can be compressed and how can they be quickly constructed and used for point-location queries.
In Section 2.3 we describe a randomized extension of this data-structure, known as skip-quadtree, which
enables us to maintain the compressed quadtree efficiently under insertions and deletions. In Section 2.5, we
turn our attention to applications of compressed quadtrees, showing how quadtrees can be used to compute
good triangulations of an input point set.
23
2.1.1 Fast point-location in a quadtree
One possible interpretation of quadtrees is that they are a multi-grid representation of a point-set. In partic-
ular, given a node v, with a square S v , which is of depth i (the root has depth zero), then the side length of
S v is 2−i , and it is a square in the grid G2−i . In fact, we will refer to `(v) = −i as the level of v. However,
a cell in a grid has a unique ID made out of two integer numbers. Thus, a node v of a quadtree is uniquely
defined by the triple id(v) = (`(v), bx/rc , by/rc), where (x, y) is any point in 2v , and r = 2`(v) .
Furthermore, given a query point q, and a de- QTFastPntLocInner(T, q, lo, hi).
sired level `, we can compute the ID of the quadtree mid ← b(lo + hi)/2c
cell of this level that contains q in constant time. v ← QTGetNode(T, q, mid)
Thus, this suggests a very natural algorithm for if v = null then
doing a point-location in a quadtree: Store all the return QTFastPntLocInner(T, q, lo, mid − 1).
IDs of nodes in the quadtree in a hash-table, and w ← Child(v, q)
also compute the maximal depth h of the quadtree. //w is child of v containing the point q.
Given a query point q, we now have access to If w = null then
any node along the point-location path of q in T, return v
in constant time. In particular, we want to find return QTFastPntLocInner(T, q, mid + 1, hi)
the point in T where the point-location path “falls
off” the quadtree. This we can find by perform- Figure 2.1: One can perform point-location in
ing a binary search for the dropping off point. Let a quadtree T by calling QTFastPntLocInner
QTGetNode(T, q, d) denote the procedure that, in (T, q, 0, height(T )).
constant time, returns the node v of depth d in the
quadtree T such that 2v contains the point q. Given a query point q, we can perform point-location in T by
calling QTFastPntLocInner(T, q, 0, height(T )). See Figure 2.1 for the pseudo-code for QTFastPntLocIn-
ner.
Lemma 2.1.1 Given a quadtree T of size n and of height h, one can preprocess it in linear time, such that
one can perform a point-location query in T in O(log h) time. In particular, if the quadtree has height
O(log n) (i.e., it is “balanced”), then one can perform a point-location query in T in O(log log n) time.
max p,q∈P kp − qk
Φ(P) =
min p,q∈P,p,q kp − qk
be the spread of P. In words, the spread of P is the ratio between the diameter of P and the distance between
the two closest points. Intuitively, the spread tells us the range of distances that P posses.
One can build a quadtree T for P, storing the points of P in the leaves of T, where one keep splitting a
node as long as it contains more than one point of P. During this recursive construction, if a leaf contains
no points of P, we save space by not creating this leaf, and instead creating a null pointer in the parent node
for this child.
Lemma 2.2.2 Let P be a set of n points in the unit square, such that diam(P) = max p,q∈P kp − qk ≥ 1/2. Let
T be a quadtree of P constructed over the unit square. Then, the depth of T is bounded by O(log Φ(P)), it
can be constructed in O(n log Φ(P)) time, and the total size of T is O n log Φ(P) .
24
Proof: The construction is done by a straightforward recursive algorithm as described above.
Let us bound the depth of T. Consider any two points p, q ∈ P, and observe that a node v of T of level
u = lg kp − qk − 1 containing √ p must not
√ contain q (we remind the reader that lg n = log2 n). Indeed, the
diameter of 2v is smaller than 22u < 2 kpqk /2 < kp − qk. Thus, 2v can not contain both p and q. In
particular, any node of T of level r = − lg Φ − 1 can contain at most one point of P, where Φ = Φ(P).
Thus, all the nodes of T are of depth O(log Φ).
Since the construction algorithm spends O(n) time at each level of T, it follows that the construction
time is O(n log Φ), and this also bounds the size of the quadtree T.
The bounds of Lemma 2.2.2 are tight, as one can easily verify, see Exercise 2.7.2. But in fact, if you
inspect a quadtree generated by Lemma 2.2.2, you would realize that there are a lot of nodes of T which are
of degree one (the degree of a node is the number of children it has). Indeed, a node v of T has degree larger
than one, only if it has two children, and let Pv be the subset of points of P stored in the subtree of v. Such
a node v splits Pv into, at least, two subsets and globally there can be only n − 1 such splitting nodes.
Thus, a quadtree T contains a lot of “useless” nodes. We can replace such
a sequence of edges by a single edge. To this end, we will store inside each
quadtree node v, its square 2v , and its level `(v). Given a path of vertices in the
quadtree that are all of degree one, we will replace them with a single vertex
that corresponds to the first vertex in this path, and its only child would be the
last vertex in this path (this is the first node of degree larger than one). This
compressed node has a single child, and the region rgv that it is in “charge” of
is an annulus, see the figure on the right. Otherwise, the region that a node is
in charge of is a rgv = 2v . The child corresponds to the inner square. We call
the resulting tree a compressed quadtree. Since any node that has only a single
child is compressed, we can charge it to its parent, which has two children. Since there are at most n − 1
internal nodes in the new compressed quadtree that have degree larger than one, it follows that it has linear
size (however, it still can have linear depth in the worst case).
As an application for such a compressed quadtree, consider the problem of counting how many points
are inside a query rectangle r. We can start from the root of the quadtree, and recursively traverse it, going
down a node only if its region intersects the query rectangle. Clearly, we will report all the points contained
inside r. Of course, we have no guarantee about the query time, but in practice, this might be fast enough.
25
in the appropriate node of Tout . This takes linear time (ignoring the time to construct Tin and Tout ). Thus,
the overall construction time is O(n log n).
Theorem 2.2.3 Given a set P of n points in the plane, one can compute a compressed quadtree of P in
O(n log n) deterministic time.
Definition 2.2.4 (Canonical square and grid.) A square is a canonical square, if it is contained inside the
unit square, it is a cell in a grid Gr , and r is a power of two (i.e., it might correspond to a node in a quadtree).
We will refer to such a grid Gr , as a canonical grid.
For reasons that would become clear later, we want to construct the quadtree out of a list of quadtree
nodes that must appear in the quadtree. Namely, we get a list of canonical grid cells that must appear in the
quadtree (i.e., the level of the node, together with its grid ID).
Lemma 2.2.5 Given a list C of n canonical squares, all lying inside the unit square, one can construct a
compressed quadtree T such that for any square c ∈ C, there exists a node v ∈ T , such that 2v = c. The
construction time is O(n log n).
Proof: The construction is similar to Theorem 2.2.3. Let P be a set of n points, where pc ∈ P, if c ∈ C,
and pc is the center of c. Next, find, in linear time, a canonical square C that contains at least n/250 points
of P, and at most n/2 points of P. Let U be the list of all squares of C that contain c, let Cin be the list
of squares contained inside c, and let Cout be the list of squares of C that do not intersect the interior of c.
Recursively, build a compressed quadtree for Cin ‘ and Cout , denoted by Tin and Tout , respectively.
Next, sort the nodes of U in decreasing order of their level. Also, let π be the point-location path of c in
Tout . Clearly, adding all the nodes of U to Tout is no more than performing a merge of π together with the
sorted nodes of U. Whenever we encounter a square of U that does not have a corresponding node at π, we
create this node, and insert it into π. Let Tout 0 denote the resulting tree. Next, we just hang T in the right
in
place in Tout0 . Clearly, the resulting quadtree has all the squares of C as nodes.
As for the running time, we have T (C) = T (Cin ) + T (Cout ) + O(n) + O(|U| log |U|) = O(n log n), since
|Cout | + |Cin | + |U| = n and |Cin | , |Cout | ≤ (249/250)n.
Definition 2.2.6 Let T be a tree with n nodes. A separator in T is a node v, such that if we remove v from
T, we remain with a forest, such that every tree in the forest has at most dn/2e vertices.
Lemma 2.2.7 Every tree has a separator, and it can be computed in linear time.
Proof: Consider T to be a rooted tree, and initialize v to be the root of T. We perform a walk on T. If v
is not a separator, then one of the children of v in T must have a subtree of T of size ≥ dn/2e nodes. Set v to
be this node. Continue in this walk, till we get stuck. The claim is that v is the required node. Indeed, since
we always go down, and the size of the subtree shrinks, we must get stuck. Thus, consider w as the node
26
we got stuck at. Clearly, the subtree of w contains at least dn/2e nodes (otherwise, we would not set v = w).
Also, all the subtrees of w have size ≤ dn/2e, and the connected component of T \ {w} containing the root
contains at most n − dn/2e ≤ bn/2c nodes. Thus, w is the required separator.
This suggests a natural way for processing a compressed quadtree for point-location queries. Find a
separator v ∈ T , and create a root node fv for T 0 which has a pointer to v; now recursively build finger trees
to each tree of T \ {v}, and hang them on w. Given a query point q, we traverse T 0 , where at node fv ∈ T 0 , we
check whether the query point q ∈ 2v , where v is the corresponding node of T. If q < 2v , we continue the
search into the child of fv , which corresponds to the connected component outside 2v that was hung on fv .
Otherwise, we continue into the child that contains q. This takes constant time per node. As for the depth
for the finger tree T 0 , observe D(n) ≤ 1 + D(dn/2e) = O(log n). Thus, a point-location query in T 0 takes
logarithmic time.
Theorem 2.2.8 Given a compressed quadtree T of size n, one can preprocess it in O(n log n) time, such that
given a query point q, one can return the lowest node in T whose region contains q in O(log n) time.
Let T1 , . . . , Tm be the quadtrees of the sets P = S 1 , . . . , S m , respectively. Note, that the nodes of Ti are
a subset of the nodes appear in Ti−1 . As such, every node in Ti would have pointers to its own copy in Ti−1
and a pointer to its copy in Ti+1 if it exists there. We will refer to this data-structure as skip-quadtree.
Point-location queries. Given a query point q we want to find the leaf of T1 that contains it. The search
algorithm is quite simple, starting at Tm you find the leaf in Ti that contains the query point, and then move
to the corresponding node in Ti−1 , and continue the search from there.
27
2.3.2 Running time analysis
We analyze the time needed to perform a point-location query. In particular, we claim that the expected
query time O(log n). To see that we will use the standard backward analysis. Consider the leaf v of Ti that
contains q, and consider the path v = v1 , v2 , . . . , vr from v to the root vr of the compressed quadtree Ti . Let
π denote this path. Clearly, the amount of time spent in the search in the tree Ti , is proportional to how far
we have to go on this list, till we hit a node that appears in Ti+1 . (During the point-location we traverse this
snippet of the path in the other direction - we are in a leaf of Ti+1 , we jump into the corresponding node
of Ti , and traverse down till we reach a leaf.) Note, that the node v j stores (at least) j points of S i , and if
any pair of them appears in S i+1 then at least one of the nodes on π below the node v j would appear in Ti+1
(since the quadtree Ti+1 needs to “separate” these two points and this is done by a node of the path). Let
denote this set of points by U j , and let X j be an indicator variable which is one if v j does not appear in Ti+1 .
Clearly,
h i h i j+1
E X j = Pr no pair of points of U j is in T i+1 ≤ ,
2j
since the points of S i are randomly and independently chosen to be in S i+1 and the event happens only if
zero or one points of U j are in S i+1 . hP i P h i P
Thus, the expected search time in T i is E j X j = j E X j = j ( j + 1)/2 j = O(1).
Thus, the overall expected search time in the skip-quadtree is proportional to the number of levels in the
gradation.
Let Zi = |S i | be the number elements stored in the ith level of the gradation. We know that Z1 = n,
and E[Zi ] =. In particular, E[Zi ] = E[E[Zi ]] = E[Zi−1 /2] = · · · = n/2i−1 . Thus, E[Zα ] ≤ 1/n10 , where
α = 11 lg n . Thus, by Markov’s inequality, we have that
E[Zα ] 1
Pr[m > α] = Pr[Zα ≥ 1] ≤ = 10 .
1 n
We summarize:
Lemma 2.3.2 A gradation defined over n elements has O(log n) levels both in expectation and with high
probability.
This implies that a point-location query in the skip quadtree takes, in expectation, O(log n) time.
Since, with high probability, there are only O(log n) levels in the gradation, it follows that the expected
search time is O(log n).
High Probability. In fact, one can show that this point-location query time bound holds with high proba-
bility, and we sketch (informally) the argument why this is true. Consider the variable Yi that is the number
P
of nodes of T i being visited during the point-location query. We have that Pr[Yi ≥ k] ≤ j=k (k + 1)/2k =
O(k/2k ) = O(1/ck ), for some constant c > 1. Thus, we have a sum of logarithmic number of independent
random variables, each one of them behaves like a variable with geometric distribution. As such, we can
apply a Chernoff-type inequality (see Exercise 25.4.3) to get an upper bound on the probably that the sum
of these variables exceeds O(log n). This probability is bounded by 1/nO(1) .
Note, the longest query time is realized by one of the points stored in the quadtree. Since there are n
points stored in the quadtree, this implies that with high probability all point-location queries takes O(log n)
time. Also, observe that the structure of the skip-quadtree is uniquely determined by the gradation. Since
the gradation is oblivious to the history of the data-structure (i.e., what points where inserted and deleted).
As such, these bounds on the performance hold at any point in time during the usage of the skip-quadtree.
We summarize:
28
Theorem 2.3.3 Let T be an empty skip-quadtree used for a sequence of n operations (i.e., insertions, dele-
tions and point-location queries). Then, with high probability (and thus also in expectation), the time to
perform each such operation takes O(log n) time.
So consider a quadtree T, and a DFS traversal of T, where the DFS always traverse the
children of a node in the same relative order (i.e., say, first the bottom-left child, then the
bottom-right child, top-left child, and top-right child).
Consider any two canonical squares 2 and 2 b, and imagine a quadtree T that contains
both squares (i.e., there are nodes in T with these squares as their cells). Notice, that the
above DFS would always visit these two nodes in a specific order, independent of the structure of the rest of
the quadtree. Thus, if 2 gets visited before 2 b, we denote this fact by 2 ≺ 2 b. This defines a total ordering
over all canonical squares. It would be in fact useful to extend this ordering to also includes points. Thus,
consider a point p and a canonical square 2. If p ∈ 2 then we will say that 2 ≺ p. Otherwise, if 2 ∈ Gi , let
b be the cell in Gi that contains p. We have that 2 ≺ p if and only if 2 ≺ 2
2 b. Next, consider two points p and
q, and let Gi be a grid fine enough such that p and q lie in two different cells, say, 2 p and 2q , respectively.
Then p ≺ q if and only if 2 p ≺ 2q .
We will refer to the ordering induced by ≺ as the Q-order.
The ordering ≺ when restricted only to points, is the ordering along a space
filling mapping that is induced by the quadtree DFS. This ordering is know as
the Z-order . Note, however, that since we allow comparing cells to cells, and
cells to points, the Q-order no longer has this exact interpretation. Furthermore,
unlike the Peano or Hilbert curve, our mapping is not continuous. Our mapping
has the advantage of being easy to define. Indeed, given a real number α ∈
P
[0, 1), with the binary expansion α = 0.x1 x2 x3 . . . (i.e., α = ∞ −i
i=1 xi 2 ), our
mapping will map it to the point (0.x2 x4 x6 . . . , 0.x1 x3 x5 . . .).
` = min(bit∆ (x p , xq ), bit∆ (y p , yq )) − 1,
where x p and y p denote the x and y coordinates of p, respectively. Thus, the side length of 2 = LCA(p, q)
is ∆ = 2−` . Let x0 = bx/∆c and y0 = by/∆c. Thus,
29
The LCA of two cells is just the LCA of their centers.
Now, given two cells 2 and 2 b, we would like to determine their Q-order. If 2 ⊆ 2b then 2
b ≺ 2. If
b ⊆ 2 then 2 ≺ 2
2 b. Otherwise, let 2e = LCA(2, 2b). We can now determine which children of 2e contains
these two cells, and since we know the traversal ordering among children of a node in a quadtree we can
now resolve this query in constant time.
Corollary 2.4.1 Assuming that the bit∆ operation and the b·c operation can be performed in constant time,
then one can compute LCA of two points (or cells) in constant time. Similarly, the Q-order can be resolved
in constant time.
Computing bit∆ efficiently. It seems somewhat suspicious that one assumes that the bit∆ operations can
be done in constant time on a classical RAM machine. However it is a reasonable assumption on a real
world computer. Indeed, in floating point representation, once you are given a number it is easy to access its
mantissa and exponent in constant time. If the exponents are different then bit∆ can be computed in constant
time. Otherwise, we can easily xor the mantissas of both numbers, and compute the most significant bit
that is on. This can be done in constant time by converting the xored mantissa into floating point number,
and computing its log2 (some CPUs have this command built in). Observe, that all these operations are
implemented in hardware in the CPU and require only constant time.
Lemma 2.4.2 Given a quadtree T of size n, with its leaves stored in an ordered-set data-structure D ac-
cording to the Q-order, then one can perform point-location query in O(Q(n)) time, where Q(n) is the time
to perform a search query in D.
Lemma 2.4.3 Given two quadtrees T 0 and T 00 given as sorted lists of their nodes, one can compute the
merged quadtree in linear time (in their size) by merging the two sorted lists and removing duplicates.
30
If the binary search using the Q-order return a node v, such that 2v ≺ q then if q ∈ 2v then we are done,
as v is the required answer. So, if q < 2v then it must be that the node u such that q ∈ rgu is a compressed
node. As such, consider the cell 2 = LCA(2v , q). Clearly, the compressed node w that its region contains
q have the property that 2 ⊆ 2w . Furthermore, let z be the only child of w. We have that 2z ⊆ 2 ⊆ 2w .
In particular, in the ordering of nodes by the Q-order we have that 2w ≺ 2 ≺ 2z , where 2w and 2z are
consecutive in the ordering of the nodes of the compressed quadtree. It follows, that we can find 2w by
doing an additional binary search for 2 in the ordered set of the nodes of the compressed quadtree. We
summarize:
Lemma 2.4.4 Given a compressed quadtree T of size n, with its leaves stored in an ordered-set data-
structure D according to the Q-order, then one can perform point-location query in T in O(Q(n)) time,
where Q(n) is the time to perform a search query in D.
• The node w is a leaf, and there is no point associated with it. Then we just stored p at w, and we are
done.
• The node w is a leaf, and there is a point p already stored in w. In this case, let 2 = LCA(p, q),
and insert 2 into the compressed quadtree. Furthermore, split 2 into its children, and also insert the
children into the compressed quadtree. Finally, associate p with the new leaf that contains it, and
associate q with the leaf that contains it. Note, that because of the insertion w becomes a compressed
node if 2w , 2, and it becomes a regular internal node otherwise.
• The node w is a compressed node. Let z be the child of w, and consider 2 = LCA(2z , q). Insert 2
into the compressed quadtree if 2 , 2w (note that in this case w would still be a compressed node, but
with a larger “hole”). Also insert all the children of 2 into the quadtree, and store p in the appropriate
child. Hang 2z from the appropriate child, and turn this child into a compressed node.
In all three cases, the insertion requires a constant number of search/insert operations on the ordered-set
data-structure.
Deletion is done in a similar fashion. We delete the point from the node that contains it, and then we
trim away nodes that are no longer necessary.
Theorem 2.4.5 Assuming one can compute the Q-order in constant time, then one can maintain a com-
pressed quadtree of point-set in O(log n) time per operation, where insertion, deletion and point-location
queries are supported. Furthermore, this can implemented using an ordered-set data-structure.
Lemma 2.5.1 Let φ be the smallest angle for a triangle. We have that 1/ sin φ ≤ Aratio (4) ≤ 2/ sin φ.
31
Proof: Consider the triangle 4 = 4abc.
C
b a
h
φ
A c B
We have Aratio (4) = c/h. However, h = b sin φ, and since a is the shortest edge in the triangle (since it is
facing the smallest angle), it must be that b is the middle length edge. As such, 2b ≥ a + b ≥ c. Thus,
Aratio (4) ≥ b/h = b/(b sin φ) = 1/ sin φ. And similarly, Aratio (4) ≤ 2b/h = 2b/(b sin φ) = 2/ sin φ.
Another natural measure of sharpness is the edge ratio Eratio (4), which is the ratio between a triangle’s
longest and shortest edges. Clearly, Aratio (4) > Eratio (4), for any triangle 4. For a triangulation M, we
denote by Aratio (M) the maximum aspect ratio of a triangle in M. Similarly, Eratio (M) denotes the maximum
edge ratio of a triangle in M.
Definition 2.5.2 A corner of a quadtree cell is one of the four vertices of its square. The corners of the
quadtree are the points that are corners of its cells. We say that the side of a cell is split if either of the
neighboring boxes sharing it is split. A quadtree is balanced if any side of an unsplit cell may contain only
one quadtree corner in its interior. Namely, adjacent leaves are either of the same level, or of adjacent levels.
Lemma 2.5.3 Let P be a set of points in the plane, such that diam(P) = Ω(1) and Φ = Φ(P). Then, one can
compute a (minimal size) balanced quadtree T of P, in time O(n log n + m) time, where m is the size of the
output quadtree.
Proof: Compute a compressed quadtree T of P in O(n log n) time. Next, we traverse T, and replace
every compressed edge of T by the sequence of quadtree nodes that defines it. To guarantee the balance
condition, we create a queue of the nodes of T, and store the nodes of T in a hash table, with their IDs.
We handle the nodes in the queue, one by one. For a node v, we check whether the current adjacent
nodes to 2v are balanced. Specifically, let c be one of 2v ’s neighboring cells in the grid of 2v , and let c p
be the square containing c in a grid one level up. We compute id(c), id(c p ), and check if there is a node in
T with those IDs. If not, we create a node w with region c p and id(c p ), and recursively retrieve its parent
(i.e., if it exists we retrieve it, otherwise, we create it), and hang w from the parent node. We credit the work
involved in creating w to the output size. We add all the new nodes to the queue. We repeat the process till
the queue is empty.
Since the algorithm never creates nodes smaller than the smallest cell in the original compressed quadtree,
it follows that this algorithm terminates. It is also easy to argue by induction that any balanced quadtree of
P must contain all the nodes we created. Overall, the running time of the algorithm is O(n log n + m), since
the work associated with any newly created quadtree node is constant.
Definition 2.5.4 The extended cluster of a cell c in a quadtree T is the set of 5 × 5 neighboring cells of c in
the grid containing c, which are all the cells in distance < 2l from c, where l is the sidelength of c.
A quadtree T over a point set P is well-balanced, if it is balanced, and for every leaf node v that contains
a (single) point of P, we have the property that all the nodes of the extended cluster of v are leaves in T (i.e.,
none of them is split and has children), and they do not contain any other point of P. In fact, we will also
require that for every non-empty node v, all the nodes of the extended cluster of v are nodes in the quadtree.
Lemma 2.5.5 Given a point set P of n points in the plane, one can compute a well-balanced quadtree of P
in O(n log n + m) time, where m is the size of the output quadtree.
32
Figure 2.2: A well balanced triangulation.
Proof: We compute a balanced quadtree T of P. Next, for every leaf node v of T which contains a point
of P, we verify that all its extended cluster are leaves of T. If any other of the nodes of the extended cluster
of v contains a point of P, we split v. If any of the extended cluster nodes is missing as a leaf, we insert
it into the quadtree (with its ancestors if necessary). We repeat this process till we stop. Of course, during
this process, we keep the balanced property valid, by adding necessary nodes. Clearly, all this work can
be charged to newly created nodes, and as such takes linear time in the output size once the compressed
quadtree is computed.
A well-balanced quadtree T of P provides for every point, a region (i.e., extended cluster) where it is
well protected from other points. It is now possible to turn the partition of the plane induced by the leaves
of T into a triangulation of P.
We “warp” the quadtree framework as follows. Let y be the corner nearest x of the leaf of T containing
x; we replace y by x as a corner of the quadtree. Finally, we triangulate the resulting planar subdivision.
Unwarped boxes are triangulated with isosceles right triangles by adding a point in the center. Only boxes
with unsplit sides have warped corners; for these we choose the diagonal that gives better aspect ratio.
Figure 2.2 shows a triangulation resulting from a variant of this method.
Lemma 2.5.6 The method above gives a triangulation QT (P) with Aratio (QT (P)) ≤ 4.
Proof: The right triangles used to triangulate the unwarped cells have aspect ratio 2. If a cell with side
length l is warped, we have two cases.
In the first case, the input point of P is inside the square of the original cell. Then we assume that the
diagonal touching the warped point is chosen; otherwise, the aspect ratio can only be better than what we
prove. Consider one of the two triangles formed, with corners the input point and two other cell corners.
The maximum
√ length hypotenuse is formed when the warped point is on its original location, and has length
h = 2l. The minimum area is formed when the point is in the center of the square, and has area a = l2 /4.
Thus, the minimum height of such a triangle 4 is ≥ 2a/h, and Aratio (4) ≤ h/(2a/h) = h2 /2a = 4.
In the second case, the input point is outside the original square. Since the quadtree is well balanced, the
new point y is somewhere inside a square of sidelength l centered at x (since we always move the closest leaf
corner to the new point). In this case, we assume that the diagonal not touching the warped point is chosen.
This divides the cell into an isosceles right triangle and another triangle. If the chosen diagonal is the longest
edge of the other triangle, then one can argue as before, and the aspect ratio is bounded by 4. Otherwise, the
33
y
z h0 x
(0, 0) w = (l, 0)
longest edge touches the input point. The altitude is minimized when the triangle is isosceles
√ with as sharp
an angle as possible; see Figure 2.3. Using the notation of Figure 2.3, we have y = (l/2, 7l/2). Thus,
1 0 l √
1 1 l √ −l 7−1 2
µ = area(4wyz) = 1 l √ 0 = = l .
2
2 l/2 ( 7/2 − 1)l 4
1 l/2 ( 7/2)l
√ √ √
We have h 2l/2 = µ, and thus h0 = 2µ/l = 7−1 √ l. The longest distance y can be from w is α =
2 2 √ √
p √
(1/2)2 + (3/2)2 l = ( 10/2)l. Thus, the aspect ratio of the new triangle is bounded by α/h0 = 10/2 / 7−1 √ ≈
2 2
2.717 ≤ 4.
For a triangulation M, let |M| denote the number of triangles of M. The Delaunay triangulation of
a point set is the triangulation formed by all triangles defined by the points such that their circumscribing
triangles are empty (the fact that this collection of triangles forms a triangulation requires a proof). Delaunay
triangulations are extremely useful, and have a lot of useful properties. We denote by DT (P) the Delaunay
triangulation of P.
P
Lemma 2.5.7 There is a constant c0 , independent of P, such that |QT (P)| ≤ c0 4∈DT (P) log Eratio (4).
Proof: For this lemma, we modify the description of our algorithm for computing QT (P). We compute
the compressed quadtree T 00 of P, and we uncompress the edges by inserting missing cells. Next, we split
a leaf of T 00 if it has side length κ, it is not empty (i.e., it contains a point of P), and there is another point
of P of distance ≤ 2κ from it. We refer to such a node as being crowded. We repeat this, till there are no
crowded leaves. Let T 0 denote the resulting quadtree. We now iterate over all the nodes v of T 0 , and insert
all the nodes of the extended cluster of v into T 0 . Let T denote the resulting quadtree. It is easy to verify that
T is well-balanced, and identical to the quadtree generated by the algorithm of Lemma 2.5.5 (although it is
unclear how to implement the algorithm described here efficiently).
Now, all the nodes of T that were created when adding the extended cluster nodes can be charged to
nodes of T 0 . Therefore we need only count the total number of crowded cells in T 0 .
Linearly many crowded cells have more than one child with points in them. It can happen at most
linearly many times that a non-empty cell c has a point of P outside it of distance 2κ from it, which in the
next level is in a cell non-adjacent to the children of c, where κ is the side length of the cell, as this point
becomes further away due to the shrinking sizes of cells as they split.
If a cell b containing a point is split because an extended neighbor was split, but no extended neighbor
contains any point, then, when either b or b’s parent was split, a nearby point became farther away than 2κ.
Again, this can only happen linearly many times.
Finally a cell may contain two points, or several extended neighbor cells may contain points, and this
situation may persist when the cells split. If splitting the children of the cell or of its neighbors separates the
34
points, we can charge linear total work. Otherwise, let Y be a maximal set of points in the union of cell b
and its neighbors, such that splitting b, its neighbors, or the children of b and its neighbors does not further
divide Y . Then some triangle of DT (P) connects two points y1 and y2 in Y with a point z outside Y.
Each split not yet accounted for occurs between the step when Y is separated from z, and the step when
y1 and y2 become more than 2κ units apart. These steps are at most O log Eratio (4y1 y2 z) quadtree levels
apart, so we can charge all the crowded cells caused by Y to 4y1 y2 z. This triangle will not be charged by
any other cells, because once we perform the splits charged to it all three points become far away from each
other in the quadtree.
Therefore the number of crowded cells can be counted as a linear term, plus terms of the form O(log Eratio (4abc))
for some Delaunay triangles 4abc.
Theorem 2.5.8 Given any point set P, we can find a triangulation QT (P) such that each point of P is a
vertex of QT (P) and Aratio (QT (P)) ≤ 4. There is a constant c00 , independent of P, such that if M is any
triangulation containing the points of P as vertices, |QT (P)| ≤ c00 |M| log Aratio (M).
In particular, any triangulation with constant aspect ratio containing P is of size Ω(QT (P)). Thus, up
to a constant, QT (P) is an optimal triangulation.
Proof: Let Y be the set of vertices of M. Lemma 2.5.7 states that there is a constant c such that
P
|QT (Y)| ≤ c 4∈DT (Y) log Eratio (4). The Delaunay triangulation has the property that it maximizes the
minimum angle of the triangulation, among all triangulations of the point set [For97].
If Y = P, then using this maxminŋangle property, we have Aratio (M) ≥ 21 Aratio (DT (P)) ≥ 12 Eratio (DT (P)),
by Lemma 2.5.1. Hence
X
|QT (P)| ≤ c log Eratio (DT (P)) = c |M| Eratio (DT (P)) ≤ 2c |M| Aratio (M).
4∈DT (P)
Otherwise, P ⊂ Y. Imagine running our algorithm on point set Y, and observe that |QT (P)| ≤ |QT (Y)|.
By the same argument as above, |QT (Y)| ≤ c |M| log Aratio (M).
35
The idea of storing a quadtree in an ordered set by using the Q-order on the nodes (or even only on the
leaves) is due to Gargantini [Gar82], and it is referred to as linear quadtrees in the literature. The idea was
used repeatedly for getting good performance in practice from quadtrees.
It is maybe beneficial to emphasize that if one does not require the internal nodes of the compressed
quadtree for the application, then one can avoid storing them in the data-structure. In fact, if one is only
interested in the point themselves, then can even skip storing the leaves themselves, and then the compressed
quadtree just becomes a data-structure that stores the points according to their Z-order. This approach can
be used for example to construct a data-structure for approximate nearest neighbor [Cha02] (however, this
data-structure is still inferior, in practice, to the more optimized but more complicated data-structure of Arya
et al. [AMN+ 98]). The author finds that thinking about such data-structures as compressed quadtrees (with
the whole additional unnecessary information) more intuitive, but the reader might disagree .
Z-order and space filling curves. The idea of using Z-order for speeding up spatial data-structures can
be traced back to the above work of Gargantini [Gar82], and it is widely used in databases and seems to
improve performance in practice [KF93]. The Z-order can be viewed as a mapping from the unit interval
to the unit-square, by splitting the odd bits, of a real number α ∈ [0, 1), to be the x-coordinate and the even
bits of α to encode the y-coordinate of the mapped point. While this mapping is simple to define it is not
continuous. Somewhat surprisingly one can find a continuous mapping that maps the unit interval to the
unit-square, see Exercise 2.7.4. A large family of such mappings is known by now, see Sagan [Sag94] for
an accessible book on the topic.
But is it really practical? Quadtrees seems to be widely used in practice and perform quite well. Com-
pressed quadtrees seems to be less widely used, but they have the benefit of being much simpler than their
relatives which seems to be more practical but theoretically equivalent.
Good triangulations. Balanced quadtree and good triangulations are due to Bern et al. [BEG94], and our
presentation closely follows theirs. The problem of generating good triangulations had received consider-
able attention recently, as it is central to the problem of generating good meshes, which in turn are important
for efficient numerical simulations of physical processes. The main technique used in generating good trian-
gulations is the method of Delaunay refinement. Here, one computes the Delaunay triangulation of the point
set, and inserts circumscribed centers as new points, for “bad” triangles. Proving that this method converges
and generates optimal triangulations is a non-trivial undertaking, and is due to Ruppert [Rup93]. Extending
it to higher dimensions, and handling boundary conditions make it even more challenging. However, in
practice, the Delaunay refinement method outperforms the (more elegant and simpler to analyze) method of
Bern et al. [BEG94], which easily extends to higher dimensions. Namely, the Delaunay refinement method
generates good meshes with fewer triangles.
Furthermore, Delaunay refinement methods are slower in theory. Getting an algorithm to perform De-
launay refinement in the same time as the algorithm of Bern et al. is still open, although Miller [Mil04] got
an algorithm with only slightly slower running time.
Very recently, Alper Üngör came up with a “Delaunay-refinement type” algorithm, which outputs better
meshes than the classical Delaunay refinement algorithm [Üng04]. Furthermore, by merging the quadtree
approach with Üngör technique, one can get an optimal running time algorithm [HÜ05].
The author reserves the right to disagree with himself on this topic in the future if the need arise.
36
2.7 Exercises
Exercise 2.7.1 (Quadtree for fat quadtrees.) [5 Points]
A triangle 4 is called α-fat if each one of its angles is at least α, where α > 0 is a prespecified constant
(for example, α is 5 degrees). Let P be a triangular planar map of the unit square (i.e., each face is a
triangle), where all the triangles are fat, and the total number of triangles is n. Prove that the complexity of
the quadtree constructed for P is O(n).
(A) [2 Points] Prove that the mapping σ covers all the points in the open square [0, 1)2 , and it is one to one.
Acknowledgments
The author wishes to thank John Fischer for his detailed comments on the manuscript.
37
38
Chapter 3
As a concrete example, consider the three points on the right. We would like q
to have a representation that captures that p has similar distance to q and r, and p r
furthermore, the q and r are close together as far as p is concerned. As such, if we
are interested in the closest pair among the three points, we will only check the Figure 3.1
distance between q and r, since they are the only pair (among the three) that might realize the closest pair.
n o
Denote by A ⊗ B = {x, y} x ∈ A, y ∈ B all the (unordered) pair of
points formed by the sets A and B.. A pair of sets of points Q and R is
(1/ε)-separated if Q R
max(diam(Q), diam(R)) ≤ ε · d(Q, R),
where d(Q, R) = minq∈Q,r∈R kq − rk. Intuitively, the pair Q ⊗ R is (1/ε)-separated if all the points of Q have
roughly the same distance to the points of R. Alternatively, imagine covering the two point sets with two
balls of minimum size, and now we require that the distance between the two balls is at least 2/ε the radius
of the larger of the two.
Thus, for the three points of Figure 3.1, the pairs {p} ⊗ {q, r} and {q} ⊗ {r} are (say) 2-separated and
described all the distances among these three points. (The gain here is quite marginal, as we replaced the
distance description, made out of three pairs of points, by distance between two pairs of sets. But stay tuned
– exciting things are about to unfold.)
Motivated by the above example, a well-separated pair decomposition is a way to describe a metric by
such “well separated” pairs of sets.
39
b A1 = {d}, B1 = {e}
{A1 , B1 } ,
a f
c A2 = {a, b, c}, B2 = {e}
{A2 , B2 } ,
A3 = {a, b, c}, B3 = {d}
{A3 , B3 } ,
A4 = {a}, B4 = {b, c}
{A4 , B4 } ,
A5 = {b}, B5 = {c} {A5 , B5 } ,
W=
A6 = {a}, B6 = { f }
{A6 , B6 } ,
A7 = {b}, B7 = { f }
{A7 , B7 } ,
d
e A8 = {c}, B8 = { f }
{A8 , B8 } ,
A9 = {d}, B9 = { f }
{A9 , B9 } ,
A10 = {e}, B10 = { f } {A10 , B10 }
f f
d e d e
a a
b c b c
(iv) (v)
Figure 3.2: (i) A point set P = {a, b, c, d, e}, (i) its decomposition into pairs, and (iii) its respective (1/2)-
WSPD. For example, the pair of points b and e (and their distance) is represented by {A2 , B2 } as b ∈ A2 and
e ∈ B2 . (iv) The quadtree T representing the point set P. (v) The WSPD as defined by pairs of vertices of T.
Definition 3.1.1 (WSPD) Awell-separated pair decomposition (WSPD ) with parameter 1/ε of P is a set
of pairs n o
W = {A1 , B1 } , . . . , {A s , Bs } ,
such that (A) Ai , Bi ⊂ P for every i.
Translation: For any pair of points p, q ∈ P, there is exactly one pair {Ai , Bi } ∈ W such that p ∈ Ai and
q ∈ Bi .
For a concrete example of a WSPD, see Figure 3.2.
Instead of maintaining such a decomposition explicitly, it is convenient to construct a tree T having the
points of P as leaves, and every pair, (Ai , Bi ) is just a pair of nodes (vi , ui ) of T, such that Ai = Pvi and
Bi = Pui , where Pv denote the points of P stored in the subtree of v, where v is a node of T. Naturally, in
our case, the tree we would use is a compressed quadtree of P, but any tree that decomposes the points such
that the diameter of a point set stored in a node drops quickly as we go down the tree might work.
40
This WSPD representation using a tree gives us a compact representation of the distances of the point
set.
Corollary 3.1.2 For a ε−1 -WSPD W, it holds, for any pair {u, v} ∈ W, that
∀q ∈ Pu , r ∈ Pv max(diam(Pu ), diam(Pv )) ≤ ε kq − rk .
It would usually be convenient to associate with each set Pu in the WSPD, an arbitrary representative
point repu ∈ P. Selecting and assigning these representative points can always be done by a simple DFS
traversal of the T used to represent the WSPD.
We compute the compressed quadtree T of P in O(n log n) time. Next, we compute the WSPD by calling
AlgWSPD(u0 , u0 ), where u0 is the root of T and AlgWSPD is depicted in Figure 3.3.
The following lemma is implied by an easy packing argument.
d
Lemma 3.1.3 Let 2 be a cell of a grid G of IR with cell diameter x. For y ≥ x, the number of cells in G at
distance at most y from 2 is O (y/x)d .®
Lemma 3.1.4 The WSPD generated by AlgWSPD is valid. Namely, for any pair {u, v} in the WSPD, we
have
max(diam(Pu ), diam(Pv )) ≤ ε · d(u, v) and d(u, v) ≤ kq − rk ,
for any q ∈ Pu and r ∈ Pv .
®
The O(·) notation here (and the rest of the chapter) hides a constant that depends on d.
41
Proof: For every output pair {u, v}, we have
ε
max{diam(Pu ), diam(Pv )} ≤ max{∆(u) , ∆(v)} ≤ d(u, v) ≤ ε · d(u, v).
8
Also, for any q ∈ Pu and r ∈ Pv , we have
since Pu ⊆ 2u and Pv ⊆ 2v .
Finally, by induction, it follows that every pair of points of P is covered by a pair of subsets {Pu , Pv }
output by the AlgWSPD algorithm. Note, that AlgWSPD always stops if both u and v are leafs, which
implies that AlgWSPD always terminates.
Proof: We trivially have that ∆(u) < ∆ p(u) and ∆(v) < ∆ p(v) .
The pair {u, v} was generated because of a sequence of recursive calls AlgWSPD(u0 , u0 ), AlgWSPD(u1 , v1 ),
. . ., AlgWSPD(u s , v s ), where u s = u, v s = v, and u0 is the root of T. Assume that u s−1 = u and
v s−1 = p(v). Then ∆(u) ≤ ∆ p(v) , since the algorithm always refine the larger cell (i.e., v s−1 = p(v) in
the pair {u s−1 , v s−1 }).
Similarly, let t be the last index such that ut−1 = p(u) (namely, ut = u and vt−1 = vt ). Then, since v is an
descendant of vt−1 , it holds that
∆(v) ≤ ∆(vt ) = ∆(vt−1 ) ≤ ∆(ut−1 ) = ∆ p(u) ,
by Lemma 3.1.5.
We charge the pair {u, v} to the node v0 , and claim that each node of T is charged at most O(ε−d )
times. To this end, fix a node v0 ∈ V(T ), where V(T ) is the set of vertices of T. Since the pair {u, v0 }
was not output by AlgWSPD (despite being considered) we conclude that 8∆(v0 ) > ε · d(u, v0 ) and as such
d(u, v0 ) < r = 8∆(v0 ) /ε. Now, there are several possibilities:
(i) ∆(v0 ) = ∆(u). But there are at most O (r/∆(v0 ))d = O(1/εd ) nodes that have the same level (i.e.,
diameter) as v0 and their cells are in distance at most r from it, by Lemma 3.1.3. Thus, this type of
charge can happened at most O(2d · (1/εd )) times, since v0 has at most 2d children.
(ii) ∆ p(u) = ∆(v0 ). By the same argumentation as above d p(u), v0 ≤ d(u, v0 ) < r. There are at most
O(1/εd ) such nodes p(u). Since the node p(u) has at most 2d children, it follows that the number of
such charges is at most O 2d · 2d · (1/εd ) .
42
(iii) ∆ p(u) > ∆(v0 ) > ∆(u). Consider the canonical grid G having 2v0 as one of its cells (see Defi-
nition 2.2.4). Let 2b be the cell in G containing 2u . Observe that 2u ( 2 b ( 2p(u) . In addition,
0 d
b, 2v0 ) ≤ d(2u , 2v0 ) = d(u, v ) < r. It follows that are at most O(1/ε ) cells like 2
d(2 b that might
participate in charging v0 , and as such, the total number of charges is O(2d /εd ), as claimed.
As such, v0 can be charged at most O 22d /εd = O 1/εd times. This implies that the total number of
pairs generated by the algorithm is O(nε−d ), since the number of nodes in T is O(n).
Since the running time of AlgWSPD is clearly linear in the output size, we have the following result.
d
Theorem 3.2.1 Given a n-point set P ⊆ IR ,and parameter 1 ≥ ε > 0, one can compute a (1 + ε)-spanner
of P with O(nε−d ) edges, in O n log n + nε−d time.
Proof: Let c ≥ 16 be an arbitrary constant, and set δ = ε/c. Compute a δ−1 -WSPD decomposition
using the algorithm of Theorem 3.1.7. For any vertex u in the quadtree T (used in computing the WSPD),
let repu
be an arbitrary
point of Pu . For every pair {u, v} ∈ W, add an edge between repu , repv with
weight
repu − repv
, and let G be the resulting graph. Observe, that by the triangle inequality, we have that
dG (q, r) ≥ kq − rk, for any q, r ∈ P.
The upper bound on the stretch is proved by induction on the length of pairs in the WSPD. So, fix a pair
x, y ∈ P, and assume that by the induction hypothesis, that for any pair z, w ∈ P such that kz − wk < kx − yk,
it holds dG (z, w) ≤ (1 + ε) kz − wk.
The pair x, y must appear in some pair{u, v} ∈ W, where x ∈ Pu , and y ∈ Pv . Thus
repu − repv
≤ d(u, v) + ∆(u) + ∆(v) ≤ (1 + 2δ) kx − yk
and
max
repu − x
,
repv − y
≤ max(∆(u) , ∆(v)) ≤ δ · d(u, v) ≤ δ
repu − repv
1
≤ δ(1 + 2δ) kx − yk < kx − yk ,
4
43
by Theorem 3.1.7 and since δ ≤ 1/16. As such, we can apply the induction hypothesis to repu x and repv y,
implying that
dG (x, repu ) ≤ (1 + ε)
repu − x
and dG (repv , y) ≤ (1 + ε)
y − repv
.
Now, since repu repv is an edge of G, it holds dG (repu , repv ) ≤
repu − repv
. Thus, by the inductive hypoth-
esis and the triangle inequality, we have that
The last step follows by an easy calculation. Indeed, since cδ = ε ≤ 1 and 16δ ≤ 1 and c ≥ 11, we have that
as required.
Lemma 3.2.2 Given a set P of n points in IRd , one can compute a spanning tree T of P, such that w(T) ≤
(1 + ε)w(M), where M is the
minimum spanning tree of P, and w(T) is the total weight of the edges of T.
This takes O n log n + nε−d time.
In fact, for any r ≥ 0 and a connected component C of M≤r , the set C is contained in a connected
component of T≤(1+ε)r .
Proof: Compute a (1 + ε)-spanner G of P. Let T be the minimum spanning tree of G. Clearly, T is the
required (1 + ε)-approximate MST. Indeed, for any q, r ∈ P, let πqr denote the shortest
path between q and r
in G. Since G is a (1 + ε)-spanner, we have that w πqr ≤ (1 + ε) kq − rk, where w πqr denote the weight of
πqr in G.
We have that G0 = (P, E) is a connected subgraph of G, where
[
E= πuv
(q,r)∈M
since G is a (1 + ε)-spanner. It thus follows that w(M(G)) ≤ w(G0 ) ≤ (1 + ε)w(M(P)), where M(G) is the
minimum spanning tree of G.
The second claim follows by similar argumentation.
44
3.2.3 Approximating the Diameter
Lemma 3.2.3 Given a set P of n points in IRd , one can compute, in O n log n + nε−d time, a pair u, v ∈ P,
such that ku − vk ≥ (1 − ε) diam(P).
Proof: Compute a (4/ε)-WSPD of P. As before, we assign for each node u of T an arbitrary representa-
tive point that belongs to Pu . Then, for every pair in the WSPD, compute the distance of the representative
point of every pair. Return the pair of representatives in the WSPD farthest away from each other.
To see why it works, consider the pair q, r ∈ P realizing the diameter of P, and let {u, v} ∈ W be the pair
in the WSPD that contain the two points, respectively (i.e., q ∈ Pu and r ∈ Pv ). We have that
repu − repv
≥ d(u, v) ≥ kq − rk − diam(Pu ) − diam(Pv )
≥ (1 − 2(ε/2)) kq − rk = (1 − ε) diam(P),
since, by Corollary 3.1.2, max(diam(Pu ) , diam(Pv )) ≤ 2(ε/4) kq − rk. Namely, the distance of the two points
output by the algorithm is at least (1 − ε) diam(P).
Analysis. Consider the pair of closest points p and q in P, and consider the pair r p Pv
{u, v} ∈ W, such that p ∈ Pu and q ∈ Pv . If Pu contains an additional point r ∈ Pu , Pu q
then we have that
k p − r k ≤ diam(Pu ) ≤ ε · d(u, v) ≤ ε kp − qk < kp − qk ,
by Theorem 3.1.7 and since ε = 1/2. Thus, kp − rk < kp − qk, a contradiction to the choice of p and q as the
closest pair. Thus, |Pu | = |Pv | = 1 and repu = q and repv = r. This implies that the algorithm indeed returns
the closest pair.
Theorem 3.2.4 Given a set P of n points in IRd , one can compute the closest pair of points of P in O(n log n)
time.
We remind the reader that we already saw a linear (expected) time algorithm for this problem in Sec-
tion 1.2. However, this is a deterministic algorithm, and it can be applied in more abstract settings where a
small WSPD still exists, while the previous algorithm would not work.
Given a set P of n points in IRd , we would like to compute for each point q ∈ P, its p r
nearest neighbor in P (formally, this is the closet point in P \ {q} to q). This is harder
than it might seem at first, since this is not a symmetrical relationship. Indeed, q might q
be the nearest neighbor to p, but r might be the nearest neighbor to q.
45
3.2.5.1 The bounded spread case
Assume P is contained in the unit square, and diam(P) ≥ 1/4. Furthermore, let Φ = Φ(P) denote the spread
of P. Compute a ε−1 -WSPD W of P, for ε = 1/4. Arguing as in the closest pair case, we have that if the
nearest neighbor to p is q, then there exists a pair {u, v} ∈ W, such that Pu = {p} and q ∈ Pv . Thus, scan
all the pairs {u, v} with a singleton as one of their sides (i.e., |Pu | = 1), and for each such singleton Pu = {r},
record for r the closest point to it in Pv . Maintain for each point the closet point to it that was encountered.
Analysis. The analysis of this algorithm is slightly tedious, but it reveals some additional interesting prop-
erties of WSPD.
A pair of nodes {x, y} of T is a generator of a pair {u, v} if {u, v} was computed inside a recursive call
AlgWSPD(x, y).
Lemma 3.2.5 Let W be a ε−1 -WSPD of a point set P generated by AlgWSPD. Consider a pair {u, v} ∈ W,
then ∆ p(v) ≥ (ε/2)d(u, v) and ∆ p(u) ≥ (ε/2)d(u, v).
Proof: Assume, for the sake of contradiction, that ∆(v0 ) < (ε/2)`, where ` = d(u, v) and v0 = p(v). By
Lemma 3.1.5, we have that
`
∆(u) ≤ ∆ v0 < ε .
2
But then
` `
d(u, v0 ) ≥ ` − ∆ v0 ≥ ` − ε ≥ .
2 2
Thus,
`
max ∆(u) , ∆ v0 < ε ≤ ε d(u, v0 ).
2
Namely, u and v0 are well-separated, and as such {u, v0 } can not be a generator of {u, v}. Indeed, if {u, v0 } was
considered by the algorithm than it would have added it to the WSPD, and never created {u, v}.
So, the other possibility is that {u0 , v} is the generator of {u, v}, where u0 = p(u). But then ∆(u0 ) ≤ ∆(v0 ) <
ε`/2, by Lemma 3.1.5. Using the same argumentation as above, we have that {u0 , v} is a well-separated pair
and as such it can not be a generator of {u, v}.
But this implies that {u, v} can not be generated by AlgWSPD, since either {u, v0 } or {u0 , v} must be a
generator of {u, v}. A contradiction.
Claim 3.2.6 For two pairs {u, v} , {u0 , v0 } ∈ W such that 2u ⊆ 2u0 , it holds that the interiors of 2v and 2v0
are disjoint.
Proof: If u0 is ancestor of u, and v0 is an ancestor of v then AlgWSPD returned the pair {u0 , v0 } and it would
have never generated the pair {u, v}.
If u0 is ancestor of u, and v is an ancestor of v0 , then
∆(u) < ∆ u0 ≤ ∆ p(v0 ) ≤ ∆(v) ≤ ∆ p(u) ≤ ∆ u0
by Lemma 3.1.5 applied to {u0 , v0 } and {u, v}. Namely, ∆(v) = ∆(u0 ). But then, the pair {u0 , v} is a generator
of both {u, v} and {u0 , v0 }. But it is impossible that AlgWSPDgenerated both pairs when processing {u0 , v} as
can be easily verified.
Similar analysis applies for the case that u = u0 .
Lemma 3.2.7 Let P be a set n points in IRd , W a ε−1 -WSPD of P, ` > 0 be a distance, and W be the set
of pairs {u, v} ∈ W such that ` ≤ d(u, v) ≤ 2`. Then, point any point p ∈ P, the number of pairs in W
containing p is O(1/εd ).
46
Proof: Let u be the leaf of the quadtree T (that is used in computing W) storing the point p, and let π
be the path between u and the root of T. We claim that W contains at most O(1/εd ) pairs with nodes that
appears along π. Let
T = v u ∈ π, {u, v} ∈ W .
The cells of T are interior disjoint by Claim 3.2.6, and they contain all the pairs
√ in W that covers p.
So, let r be the largest power of two which is smaller than (say) ε`/(4 d). Clearly, there are O(1/εd )
cells of Gr in distance at most 2` from 2u . We account for the nodes v ∈ T , as follows:
√
(i) If ∆(v) ≥ r d then 2v contains a cell of Gr , and there are at most O(1/εd ) such cells.
√ √
(ii) If ∆(v) < r d and ∆ p(v) ≥ r d, then:
(a) If p(v) is a compressed node, then p(v) contains a cell of Gr and it has only v as a single child.
As such, there are most O(1/εd ) such charges.
(b) Otherwise, p(v) is not compressed, but then diam(2v ) = diam 2p(v) /2. As such 2v contains a
cell of Gr/2 in distance at most 2` from 2u , and there are O(1/εd ) such cells.
√ √
(iii) The case ∆ p(v) < r d is impossible. Indeed, by Lemma 3.1.5, we have ∆ p(v) < r d ≤ ε`/4 =
ε
4 d(u, v), a contradiction to Lemma 3.2.5.
Lemma 3.2.8 Let P be a set n points in the plane, then one can solve the all nearest neighbor problem, in
time O(n(log n + log Φ(P))) time, where Φ is the spread of P.
Proof: The algorithm is described above. We only remain with the task of analyzing the running time.
For a number i ∈ 0, −1, . . . , − lg Φ − 4 , consider the set of pairs Wi , such that {u, v} ∈ Wi , if and only if
{u, v} ∈ W, and 2i−1 ≤ d(u, v) ≤ 2i . A point p ∈ P can be scanned at most O(1/εd ) = O(1) times because of
pairs in Wi by Lemma 3.2.7. As such, a point get scanned at most O(log Φ) times overall, which implies the
running time bound.
To handle the unbounded case, we need to use some additional geometric properties.
Lemma 3.2.9 Let u be a node in the compressed QT of P, and partition the space around repu into cones of
angle ≤ π/12. Let ψ be such a cone, and let Q be the set of all points in P which are in distance ≥ 4 diam(Pu )
from repu , and they all lie inside ψ. Let q be the closest point in Q to repu . Then, q is the only point in Q that
its nearest neighbor might be in Pu .
47
It is now straightforward (but tedious) to show that, in fact, for any p ∈ Pu , we have kr − pk ≥ kr − qk,
which implies the claim.
Lemma 3.2.9 implies that we can do a top-down traversal of QT (P), after computing a ε−1 -WSPD W
of P, for ε = 1/16. For every node u, we maintain a (constant size) set Ru of candidate points that Pu might
contain their nearest neighbor.
So, assume we had computed Rp(u) , and consider the set
[
X(u) = Rp(u) ∪ Pv .
{u,v}∈W,|Pv |=1
(Note, that we do not have to consider pairs with |Pv | > 1, since no point in Pv can have its nearest neighbor
in Pu in such a scenario.) Clearly, we can compute X in linear time in the number of pairs in W involved
with u. Now, we build a “grid” of cones around repu , and throw the points of X(u) into this grid. For each
such cone, we keep only the closest point to repu . Let Ru be the set of these closest points. Since the number
of cones is O(1), it follows that |Ru | = O(1).
Now, if Pu contains only a single point p, then we compute for any point q ∈ Ru its distance to p, and if
p is a better candidate to be a nearest neighbor, then we set p as the (current) nearest neighbor to q.
Clearly, the resulting running time (ignoring the computation of the WSPD) is linear in the number of
pairs of the WSPD and the size of the compressed quadtree. The correctness follows since p is the nearest
neighbor to q, then there must be a WSPD pair {u, v} such that Pv = {q} and p ∈ Pu . But then, the algorithm
would add q to the set Ru , and it would be in Rz , for all descendants z of u in the quadtree, such that p ∈ Pz .
In particular, if y is the leaf of the quadtree storing p, then q ∈ Ry , which implies that the algorithm computes
correctly the nearest neighbor to q.
Theorem 3.2.10 Given a set P of n points in IRd , one can solve the all nearest neighbor problem in
O(n log n) time.
48
Diameter. The algorithm for computing the diameter of Section 3.2.3 can be improved by not constructing
pairs that can not improve the (current) diameter, and constructing the underlying tree on the fly together
with the diameter. This yields a simple algorithm that works quite will in practice, see [Har01a].
All nearest neighbors. Section 3.2.5 is a simplification of the solution for the all k-nearest neighbor
problem. Here, one can compute for every point its k-nearest neighbors in O(n log n + nk) time. See [CK95]
for details.
The all nearest neighbor algorithm for bounded spread (Section 3.2.5.1) is from [HM06]. Note, that
unlike the unbounded case, this algorithm only use packing arguments for its correctness. Surprisingly,
the usage of the Euclidean nature of the underlying space (as done in Section 3.2.5.2) seems to be crucial
in getting a faster algorithm for this problem. In particular, for the case of metric spaces of low doubling
dimension (that do have a small WSPD), solving this problem requires Ω(n2 ) time in the worst case.
Dynamic maintenance. WSPD can be maintained in polylogarithmic time under insertions and deletions.
This is quite surprising when one considers that in the worst case, a point might participate in linear number
of pairs, and in fact, a node in the quadtree might participate in linear number of pairs. This is described in
detail in Callahan thesis [Cal95]. Interestingly, using randomization maintaining the WSPD can be consid-
erably simplified, see the work by Fischer and Har-Peled [FH05].
High dimension. In high dimensions, as the uniform metric demonstrates (i.e., n points all of them in
distance 1 from each other) the WSPD can have quadratic complexity. This metric is easily realizable as
the vertices of a simplex in IRn−1 . On the other hand, doubling metrics have near linear size WSPD. Since
WSPDs by themselves are so powerful, it kind of tempting to try and define dimension of a point set by the
size of the WSPD it posses. This seems like an interesting direction for future research, as currently little is
known about it (to the best of my knowledge).
3.4 Exercises
Exercise 3.4.1 (WSPD Structure.) [5 Points]
(A) Let ε > 0 be sufficiently small constant. For any n sufficiently large, show an example of a point set P
of n points, such that its (1/ε)-WSPD (as computed by AlgWSPD) has the property that a single set
participates in Ω(n) sets.®
(B) Show, that if we list explicitly the sets forming the WSPD (even if we show each set exactly once) then
the total size of such a description is quadratic. (Namely, the implicit representation we use is crucial
to achieve efficient representation.)
49
P
Let P be a set of n points in IRd . The sponginess¯ of P is the quantity X = {p,q}⊆P kp − qk. Provide an
efficient algorithm for approximating X. Namely, given P and a parameter ε > 0 it outputs a number Y such
that X ≤ Y ≤ (1 + ε)X.
(The interested reader can also verify that computing (exactly) the sum of all squared distances (i.e.,
P 2
{p,q}⊆P kp − qk ) is considerably easier.)
¯
Also known as the sum of pairwise distances in the literature, for reasons that I can not fathom.
50
Chapter 4
Do not read this story; turn the page quickly. The story may upset you. Anyhow, you probably know it already. It
is a very disturbing story. Everyone knows it. The glory and the crime of Commander Suzdal have been told in a
thousand different ways. Don’t let yourself realize that the story is the truth.
It isn’t. not at all. There’s not a bit of truth to it. There is no such planet as Arachosia, no such people as klopts, no
such world as Catland. These are all just imaginary, they didn’t happen, forget about it, go away and read something
else.
– The Crime and Glory of Commander Suzdal, Cordwainer Smith
In this chapter, we will initiate our discussion of clustering. Clustering is one of the most fundamental
computational tasks, but frustratingly, one of the fuzziest. It can be stated informally as: “Given data, find
interesting structure in the data. Go!”
The fuzziness arise naturally from the requirement that it would be “interesting”, as this is not well
defined and depends on human perception which is sometime impossible to quantify clearly. Similarly,
what is “structure” is also open to debate. Nevertheless, clustering is inherent to many computational tasks
like learning, searching and data-mining.
Empirical study of clustering concentrates on trying various measures for the clustering, and trying out
various algorithms and heuristics to compute these clusterings. See bibliographical notes for some relevant
references.
Here, we will concentrate on some well defined clustering tasks, including k-center clustering, k-median
clustering, and k-means clustering, and some basic algorithms for these problems.
4.1 Preliminaries
A clustering problem is usually defined by a set of items, and a distance function defined between these
items. While these items might be points in IRd and the distance function is just the regular Euclidean
distance, it is sometime beneficial to consider the more abstract setting of a general metric space.
Definition 4.1.1 A metric space is a pair (X, d) where X is a set and d : X × X → [0, ∞) is a metric,
satisfying the following axioms: (i) d(x, y) = 0 if and only if x = y, (ii) d(x, y) = d(y, x), and (iii) d(x, y) +
d(y, z) ≥ d(x, z) (triangle inequality).
For example, IR2 with the regular Euclidean distance is a metric space. In the following, we assume that
we are given a black-box access to d. Namely, given two points p, q ∈ X, we assume that d(p, q) can be
computed in constant time.
51
The input is a set of n points P ⊆ X. Given a set of centers C, every point of P is assigned to its nearest
neighbor in C. All the points of P that are assigned to a center c form the cluster of c, denoted by
Π C, c = p ∈ P d(p, c) ≤ d(p, C) .
Namely, the center set C partition P into clusters. This specific scheme of partitioning points by assigning
them to their closest center (in a given set of centers) is known as a Voronoi partition.
where d(p, C) = minc∈C d(p, c) denotes the distance of p to the set C. Note, that every point in a cluster is
C (P) from its respective center.
in distance at most ν∞
C (P) is minimized; namely,
Formally, the k-center problem is to find a set C of k points, such that ν∞
opt C
ν∞ (P, k) = min ν∞ (P) .
C,|C|=k
We will denote the set of centers realizing the optimal clustering by Copt . A more explicit definition
(and somewhat more confusing) of the k-center clustering is to compute the set C of size k realizing
minC maxp minc∈C d(p, c).
It is known that the k-center clustering is NP-H, and it is in fact hard to approximate within a factor
of 1, 86 even in two dimensions. Surprisingly, there is a simple and elegant algorithm that achieves 2-
approximation.
and the bottleneck point ci that realizes it. Next, we add ci to Ci−1 to form the new set Ci . We repeat this
process k times.
To make this algorithm slightly faster, observe that
di [p] = d(p, Ci ) = min d(p, Ci−1 ) , d(p, ci ) = min di−1 [p], d(p, ci ) .
52
In particular, if we maintain for p a single variable d[p] with its current distance to the closest center in the
current center set, then the above formula boils down to
d[p] ← min d[p], d(p, ci ) .
Namely, the above algorithm can be implemented using O(n) space, where n = |P|. The ith iteration of
choosing the ith center takes O(n) time. Thus, overall this approximation algorithm takes O(nk) time.
A ballof radius
r around a point p ∈ P is the set of points in P with distance at most r from p; namely,
b(p, r) = q ∈ P d(p, q) ≤ r . Thus, the k-center problem can be interpreted as the problem of covering
the points of P using k balls of minimum (maximum) radius.
Theorem 4.2.1 Given a set of n points P ⊆ X, belonging to a metric space (X, d), the algorithm GreedyK-
Center computes a set K of k centers, such that K is a 2-approximation to the optimal k-center clustering
K (P) ≤ 2νopt , where νopt = νopt (P, k). The algorithm takes O(nk) time.
of P; namely, ν∞ ∞ ∞ ∞
Proof: The running time follows by the above description, so we concern ourselves only with the ap-
proximation quality.
K (P), and let c
By definition, we have rk = ν∞ k+1 be the point in P realizing rk = d(K, P). Let C = K ∪ {r}.
Observe that by the definition of ri , we have that r1 ≥ r2 ≥ . . . ≥ rk . Furthermore, for i < j ≤ k + 1 we have
that
d(ci , c j ) ≥ d(c j , C j−1 ) = r j ≥ rk .
Namely, the distance between the closest points in C is at least rk . Now, assume for the sake of contradiction
opt opt
that rk > 2ν∞ (P, k). Consider the optimal solution, that covers P with k balls of radius ν∞ . By the triangle
opt
inequality, any two points inside such a ball are in distance at most 2ν∞ from each other. Thus, none of
opt
these balls can cover two points of C ⊆ P, since the minimum distance in C is > 2ν∞ . A contradiction,
opt
since then we need k + 1 balls of radius ν∞ to cover P.
In the spirit of never trusting a claim that has only a single proof, we provide an alternative proof.°
Alternative Proof: If every cluster of Copt contains exactly one point of K then the claim follows. Indeed,
consider
a point
p ∈ P, and let c be the center
it belongs to in Copt . Also, let k be the center of K that
is
opt opt
in Π Copt , c . We have that d p, c = d p, Copt ≤ ν∞ = ν∞ (P, k). Similarly, observe that d k, c =
opt
opt
d k, Copt ≤ ν∞ . As such, by the triangle inequality, we have that d p, k ≤ d p, c + d c, k ≤ 2ν∞ .
By the pigeon hole principle, the only other possibility is that there are two centers k and u of K that are
both in Π Copt , c , for some c ∈ Copt . Assume, without loss of generality, that u was added later to the center
set K by the algorithm GreedyKCenter, say in the ith iteration. But then, since GreedyKCenter always
chooses the point furthest away from the current set of centers, we have that c ∈ Ci−1 and
Ci−1
opt
K
ν∞ (P) ≤ ν∞ (P) = d(u, Ci−1 ) ≤ d u, k ≤ d u, c + d c, k ≤ 2ν∞ .
53
Definition 4.2.2 A set S ⊆ P is a r-net for P if the following two properties hold.
(i) Covering property: All the points of P are in distance at most r from the points of P.
(ii) Separation property: For any pair point of points p, q ∈ S , we have that d(p, q) ≥ r.
(One can relax the separation property by requiring that the points of S would be at distance Ω(r) apart.)
Intuitively, a r-net of a point-set P is a compact representation of P in the resolution r. Surprisingly, the
greedy permutation of P provides us with such a representation for all resolutions.
Theorem 4.2.3 Let P be a set of n points in a finite metric space, and let its greedy permutation be
hc1 , c2 , . . . , cn i with the associated sequence of radiuses hr1 , r2 , . . . , rn i. For any i, we have that Ci =
c1 , . . . , ci is a ri -net of P.
Proof: Note, that by construction rk = d(ck , Ck−1 ), for all k = 1, . . . , n. As such, for j < k < i ≤ n, we have
that d(c j , ck ) ≥ rk ≥ ri , which implies the required separation property. The covering property follows by
definition, see Eq. (4.1).
We will denote the set of centers realizing the optimal clustering by Copt .
There is a simple and elegant constant factor approximation algorithm for k-median clustering using
local search (its analysis however is painful).
A note on notations. Consider the set U of all k-tuples of points of P. Let pi denote the ith point of P, for
i = 1, . . . , n, where n = |P|. for C ∈ U, consider the n dimensional point
54
4.3.1 Local Search
We are given a set P of n points and a parameter k. In the following, let
A 2n-approximation. Observation 4.3.1 implies that if we compute a set of centers C using Theorem 4.2.1
then we have that
C opt opt opt
ν1 (C) /2n ≤ ν∞ (P) /2 ≤ ν∞ (P, k) ≤ ν1 ≤ ν1 (C) ⇒ ν1 (C) ≤ 2n ν1 . (4.2)
Improving it. Let 0 < τ < 1 be a parameter to be determined shortly. The local search algorithm AlgLo-
calSearchKMed initially sets the current set of centers Ccurr to be C. Next, at each iteration it checks if
the current solution Ccurr can be improved by replacing one of the centers in it by a center from the outside
(we will refer to such an operation as a swap. There are at most |P| |Ccurr | = nk choices to consider, as we
pick a center c ∈ Ccurr to throw away and a new center to replace it by o ∈ (P \ Ccurr ). We consider the
new candidate set of centers K ← Ccurr \ c ∪ o . If ν1 (K) ≤ (1 − τ)ν1 (Ccurr ) then the algorithm sets
Ccurr ← K. The algorithm continue iterating in this fashion over all possible swaps.
The algorithm AlgLocalSearchKMed stops when there is no swap that would improve the current
solution by a factor of (at least) (1 − τ). The final content of the set Ccurr is the required constant factor
approximation. Note, that the running time of the algorithm is
! !
ν (C)
2 1 2 2 log n 2 log n
O(nk) log1/(1−τ) opt = O (nk) log1+τ (2n) = O (nk) = O (nk) ,
ν1 ln(1 + τ) τ
by Eq. (4.2) and since 1/(1 − τ) ≥ 1 + τ. The final step follows since 1 + τ ≤ exp(τ) ≤ 1 + 2τ, for τ < 1/2
as can be easily verified. Thus, if τ is polynomially small, then the running time would be polynomial.
Clearly, if p is served by the same center in C and in C − c + o then δp = 0. In particular, we have that
∀p ∈ P \ Π C, c it holds δp ≤ 0. Thus,
X X X
∀c ∈ C, o ∈ Copt 0 ≤ ∆(c, o) ≤ δp = δp + δp . (4.4)
p∈Π(C,c)∪Π(Copt ,o) p∈Π(Copt ,o) p∈Π(C,c)\Π(Copt ,o)
55
What Eq. (4.4) gives us is a large family of inequalities that all of them hold together. Each inequality is
represented by a swap c → o. We would
like to pick a set of swaps such that these inequalities, when added
together, would imply that 5ν1 Copt ≥ ν1 (C); namely, that the local search algorithm provides a constant
factor approximation to optimal clustering. This idea seems to be somewhat mysterious, but to see that there
is indeed hope to achieve that, observe that if our set of swaps T has each center of Copt appearing in it
exactly once, then when we add up the first term on the right side of Eq. (4.4), we have
X X
X X
δ p
=
d( p , C − c + o) − d( p , C)
c→o∈T p∈Π(Copt ,o) c→o∈T p∈Π(Copt ,o)
X X X
≤ d( p , C ) − d( p, C) = d( p, C ) − d( p, C)
opt opt
c→o∈T p∈Π(Copt ,o) p ∈P
opt
= ν1 − ν1 (C) , (4.5)
since o ranges over the elements of Copt exactly once, and for o ∈ Copt and p ∈ Π Copt , o we have that
d(p, C − c + o) ≤ d(p, o) = d(p, Copt ). Thus, we can bound the first term of the right side of Eq. (4.4) as
required. We thus turn our attention to bounding the second term.
Some intuition about this term: This term is the total change in contribution by points in p ∈ Π C, c \
Π Copt , o . These are points that lose their current (beloved) center c. These points might be reassigned
to the new center o and their price might not change by a lot. However, the fact that p < Π Copt , o is
(intuitively) a sign that p and o might be very far away, and p would have to look far afield for a new center
in C − c + o to serve it. Namely, this might be too expensive
overall. To minimize it, intuitively, we would
like to make the overall size of the sets Π C, c \ Π Copt , o as small as possible (over our set of swaps T ).
Intuitively, these sets are minimized when Π C, c is as similar to Π Copt , o as possible.
Thus, we will say that c ∈ C dominates o ∈ Copt if and only if
Π C, c ∩ Π Copt , o > Π Copt , o /2.
Clearly, since the clusters of C form a partition of P, we have that a center o ∈ Copt can be dominated by at
most one center of C.
In addition, there are centers in C that dominates more than one center in Copt , and we will refer to such
centers as dictators. Nobody wants to deal with dictators (including us) and as such we will not have any
dictators involved in swaps in T . The other kind are centers of C that dominates no center in Copt . We will
refer to such centers as drifters. Now, assume that there are overall D dictators in C dominating S centers
of Copt . Let F be the number of centers in Copt which are not dominated by anybody (“free” centers). Then,
we have that total number of drifters is exactly
F + S − D,
as can be easily verified. Note, that S ≥ 2D, since every dictator has at least two “slaves”. As such, the
number of drifters is at least F + S − D ≥ F + S − (S /2) = F + S /2.
The set of swaps T. The set of swaps T we would consider would be constructed as follows. If c ∈ C
dominates exactly one center o in Copt , then the swap c → o will be included in T . In addition, we will
also add swaps between drifters of C with centers of Copt that are not swapped yet by swaps of T . The end
Except for arm dealers, art dealers and superpowers, of course.
56
result is that all the centers of Copt will be covered by the set of swaps T . We will do it in such a way that
every drifter participates in at most two swaps of T . Note, that the resulting set of swaps T might swap-out
a center of C at most twice, and it swap-in each center of Copt exactly once. To see why this is possible,
observe that we have at least F + S /2 drifters that we use to cover F + S “free” and “slave” vertices of Copt .
Clearly, this can be done by using each drifter at most twice in swaps of T .®
Proof: For any point p ∈ P, we have that π(p) and p lie in the same cluster in the optimal clustering.
As such, by the triangle inequality, we have d(p, π(p)) ≤ d(p, Copt ) + d(π(p), Copt ). As such, since π is a
permutation, we have that
X X opt
d(p, π(p)) ≤ d(p, Copt ) + d(π(p), Copt ) = 2ν1 .
p ∈P p ∈P
Lemma 4.3.4 For any c → o ∈ T and any o0 ∈ Copt such that o , o0 , we have that c does not dominates o0 .
Proof: If c is a drifter then the claim trivially holds. Otherwise, since c can not be a dictator, it must be that
it dominates only o, which implies that it does not dominates o0 .
X X
opt
Lemma 4.3.5 We have that δp ≤ 4ν1 .
c→o∈T p∈Π(C,c)\Π(Copt ,o)
®
Which comes to teach us that even drifters can be useful sometimes.
57
Proof: For a swap c → o ∈ T consider a point p ∈ Π C, c \ Π Copt , o . Let o0 be the optimal center which
its cluster contains p. By Lemma 4.3.4, we know that c does not dominates o0 , and as such, by Lemma 4.3.2
, we have π(p) < Π C, c . As such, we have that
Note, that by the triangle inequality, νp is always non-negative. As such, we have that
X X X X X
γ= δp ≤ νp ≤ 2 νp ,
c→o∈T p∈Π(C,c)\Π(Copt ,o) c→o∈T p∈Π(C,c)\Π(Copt ,o) p∈P
since each center of c get swapped out at most twice by T , and as such a point might contribute twice to the
summation. But then
X X X
opt
γ≤2 νp = 2 d(p, π(p)) + d(π(p), C) − d(p, C) = 2 d(p, π(p)) ≤ 4ν1 ,
p ∈P p∈P p ∈P
Removing the strict improvement assumption. In the above proof, we assumed that the current local
minimum can not be improved by a swap. Of course, this might not hold for the algorithm solution, since
the algorithm allows a swap only if it makes “significant” progress. In particular, Eq. (4.3) is in fact
∀c ∈ C, o ∈ P \ C − τ ν1 (C) ≤ ν1 C − c + o − ν1 (C) . (4.6)
To adapt the proof to use this modified inequalities, observe that the proof worked by adding up k
inequalities defined by Eq. (4.3) and showing that the right side is bounded by 5ν1 Copt − ν1 (C). Repeating
the same argumentation on the modified inequalities, would yield
−τk ν1 (C) ≤ 5ν1 Copt − ν1 (C) .
opt
This implies ν1 (C) ≤ 5ν1 /(1 − τk). For arbitrary 0 < ε < 1, Setting τ = ε/2k we have that ν1 (C) ≤
opt
5(1 + εk)ν1 , since 1/(1 − τk) ≤ 1 + 2τk = 1 + ε, for τ ≤ 1/2k. We summarize:
Theorem 4.3.7 Let P be a set of n points in a metric space. For 0 < ε < 1, one can compute a (5 + ε)-
approximation to the optimal k-median clustering of P. The running time of the algorithm is O n2 k3 logε n .
58
4.4 On k-means clustering
In the k-means clustering problem, a set P ⊆ X is provided together with a parameter k. We would like to
find a set of k points C ⊆ P, such that the sum of squared distances of all the points of P to their closest
point in C is minimized.
Formally, given a set of centers C, the k-center clustering price of clustering P by C is denoted by
X
ν2C (P) = (d(p, C))2 ,
p ∈P
and the k-means problem is to find a set C of k points, such that ν2C (P) is minimized; namely,
opt
ν2 (P, k) = min ν2C (P) .
C,|C|=k
Theorem 4.4.1 Let P be a set of n points in a metric space. For 0 < ε < 1, one can compute a (25 + ε)-
approximation to the optimal k-means clustering of P. The running time of the algorithm is O n2 k3 logε n .
k-center Clustering. The algorithm GreedyKCenter was described by Gonzalez [Gon85], but it was
probably known before, as the notion of r-net is much older. The hardness of approximating k-center
clustering was shown by Feder and Greene [FG88].
k-median/means clustering. The local search algorithm is due to Arya et al. [AGK+ 01]. The extension
to k-mans is due to Kanungo et al. [KMN+ 04]. The extension is not completely trivial since the triangle
inequality no longer holds. However, some approximate version of the triangle inequality does hold. Instead
of performing a single swap, one can decide to do p swaps simultaneously. Thus, the running time deteri-
orates since there are more possibilities to check. This improves the approximation constant for k-median
(resp., k-means) to (3 + 2/p) (resp. (3 + 2/p)2 ). Unfortunately, this is (essentially) tight in the worst case.
See [AGK+ 01, KMN+ 04] for details.
The k-median and k-mean clustering are more interesting in the Euclidean settings where there is con-
siderably more structure, and one can compute (1 + ε)-approximation in polynomial time for fixed ε. We
will return to this topic later.
Since k-median and k-means clustering can be easily be reduced to Dominating Set in a graph, this
implies that both clustering problems are NP-H to solve exactly.
59
One can compute a similar permutation to the greedy permutation (for k-center clustering) also for k-
median clustering. See the work by Mettu and Plaxton [MP03].
Handling Outliers. The problem of handling outliers is still not well understood. See the work of Charikar
et al. [CKMN01] for some relevant results. In particular, for k-center clustering they get a constant factor
approximation, and Exercise 4.6.2 is taken from there. For k-median clustering they present a constant fac-
tor approximation using linear programming relaxation, that also approximates the number of outliers. Re-
cently, Chen [Che07] provided a constant factor approximation algorithm by extending the work of Charikar
et al.. The problem of finding a simple algorithm with simple analysis for k-median clustering with outliers
is still open, as Chen work is quite involved.
Open Problem 4.5.1 Get a simple constant factor k-median clustering algorithm that runs in polynomial
time and uses exactly m outliers. Alternatively, solve this problem in the case where P is a set of n points in
the plane. (The emphasize here is that the analysis of the algorithm should be simple.)
Bi-criteria approximation. All clustering algorithms tend to become considerably easier if one allows to
trade-off in the number of clusters. In particular, one can compute a constant factor approximation to the
optimal k-median/means clustering using O(k) centers in O(nk) time. The algorithm succeeds with constant
probability. See the work by Indyk [Ind99] and Chen [Che06] and references therein.
Facility Location. All the problems mentioned here fall into the family of facility location problems.
There are numerous variants. The more specific facility location problem is a variant of k-median clustering
where the number of clusters is not specified, but instead one has to pay to open a facility in a certain
location. Local-search also works for this variant.
Local search. As mentioned above, local search also works for k-means clustering [AGK+ 01]. A collec-
tion of some basic problems for which local search works is described in the book by Kleinberg and Tardos
[KT06]. Local search is a widely used heuristic for attacking NP-H problems. The idea is usually to
start from a solution and try to locally improve it. Here, one defines a neighborhood of the current solution,
and one tries to move to the best solution in this neighborhood. In this sense, local search can be thought
of as a hill-climbing/EM (expectation maximization) algorithm. Problem for which local search was used
include Vertex Cover, Traveling Salesperson and Satisfiability, and probably many more problems.
Provable cases where local search generates a guaranteed solution are less common and include facility
location, k-median clustering [AGK+ 01], k-means clustering [KMN+ 04], weighted max cut, metric labeling
problem with the truncated linear metric [GT00], and image segmentation [BVZ01]. See [KT06] for more
references and a nice discussion of the connection of local search to the Metropolis algorithm and simulated
annealing.
4.6 Exercises
Exercise 4.6.1 (Handling outliers.) [10 Points]
Given a point set P, we would like to perform k-median clustering of it, when we are allowed to ignore m
of the points. These m points are outliers which we would like to ignore since they represent irrelevant data.
Unfortunately, we do not know the m outliers in advance. It is natural to conjecture that one can perform a
local search for the optimal solution. Here ones maintain a set of k centers and a set of m outliers. At every
point in time the algorithm moves one of the centers or the outliers if it improves the solution.
60
Show that local-search does not work for this problem; namely, the approximation factor is not a con-
stant.
61
62
Chapter 5
“I’ve never touched the hard stuff, only smoked grass a few times with the boys to be polite, and that’s all, though
ten is the age when the big guys come around teaching you all sorts to things. But happiness doesn’t mean much to
me, I still think life is better. Happiness is a mean son of a bitch and needs to be put in his place. Him and me aren’t
on the same team, and I’m cutting him dead. I’ve never gone in for politics, because somebody always stand to gain
by it, but happiness is an even crummier racket, and their ought to be laws to put it out of business.”
– – Momo, Emile Ajar
In this chapter we will try to quantify the notion of geometric complexity. It is intuitively clear that a a
c
(i.e., disk) is a simpler shape than an (i.e., ellipse), which is in turn simpler than a - (i.e., smiley). This
becomes even more important when we consider several such shapes and how they interact with each other.
As these examples might demonstrate, this notion of complexity is somewhat elusive.
Next, we show that one can capture the structure of a distribution/point set by a small subset. The size
here would depend on the complexity of the shapes/ranges we care about, but it would be independent of
the size of the point set.
5.1 VC Dimension
Definition 5.1.1 A range space S is a pair (X, R), where X is a ground set (finite or infinite) set and R is
a (finite or infinite) family
of subsets
of X. The elements of X are points and the elements of R are ranges.
For A ⊆ X, let R|A = r ∩ A r ∈ R denote the projection of R on A.
If R contains all subsets of A (i.e., if A is finite, we have R = 2|A| ) then A is shattered by R.
|A |A
The Vapnik-Chervonenkis dimension (or VC dimension) of S , denoted by dimVC (S ), is the maximum
cardinality of a shattered subset of X. If there are arbitrarily large shattered subsets then dimVC (S ) = ∞.
5.1.1 Examples
Intervals. Consider the set X to be the real line, and R be the set of all intervals on the real 1 2
p q r
line. Clearly, for two points in the real line, say, A = {1, 2} one can find 4 intervals that
contain all possible subsets of A. However, this is false for a set of three points B = {p, q, r}, since there is no
interval that can contain the two extreme points p and r without also containing q. Namely, the subset {p, r}
is not realizable for intervals. Implying that the largest shattered set by the range space (real line, intervals)
if of size two. We conclude that the VC-dimension of this space is two.
63
Disks. Let X = IR2 , and let R be the set of disks in the plane. Clearly, for three points in the
plane 1, 2, 3, one can find 8 disks that realize all possible 23 different subsets. See figure on
the right. 1
But can disks shatter a set with four points? Consider such a set P of four points, and 3
there are two possible options. Either the convex-hull of P has three points on its boundary,
2
and in this case, the subset having those vertices in the subset but not including the middle {1.2}
namely, the ranges of S are the complement of the ranges in S. What is the VC-dimension of S? Well,
clearly a set B ⊆ X is shattered by S, if and only if it is shattered by S. Thus, dimVC S = dimVC (S).
Lemma 5.1.2 For a range space S = (X, R) we have that for the complement range space S it holds
dimVC (S) = dimVC S .
Theorem 5.1.3 (Radon’s Lemma) Let A = {p1 , . . . , pd+2 } be a set of d + 2 points in IRd . Then, there exists
two disjoint subsets C and D of A, such that CH(C) ∩ CH(D) , ∅.
P P
Proof: We claim that there exists β1 , . . . , βd+2 , non all of them zero, such that i βi pi = 0 and i βi = 0.
Proof: If pd+1 = pd+2 then we are done. Otherwise, without loss of generality, assume that p1 , . . . , pd spans IRd . Then, there are
P P P P
two non-zero combinations of p1 , . . . , pd , such that pd+1 = di=1 αi pi and pd+2 = di=1 γi pi . Let α = di=1 αi − 1 and γ = di=1 γi − 1.
Pd
If α = 0 then i=1 αi pi − pd+1 (which
Pis the origin) is the required
combination. Similarly, we are done if γ = 0. Otherwise, consider
P
the point di=1 (αi /α)pi − pd+1 /α − di=1 (γi /γ)pi − pd+2 /γ . Clearly, this is the required point.
Pk
Assume, for the sake of simplicity of exposition, that the β1 , . . . βk ≥ 0 and βk+1 , . . . , βd+2 < 0. Furthermore, let µ = i=1 βi .
We have that
X k X
d+2
βi pi = − βi pi .
i=0 i=k+1
64
Pk Pd+2
In particular, v = i=0 (βi /µ)pi is a point in the CH({p1 , . . . , pk }) and i=k+1 −(βi /µ)pi ∈ CH({pk+1 , . . . , pd+2 }). We conclude that v
is in the intersection of the two convex hulls, as required.
In particular, this implies that if a set Q of d + 2 points is being shattered by S , we can partition this set
Q into two disjoint sets A and B such that CH(A) ∩ CH(B) , ∅. It should now be clear that any halfspace
h+ containing all the points of A, must also contain a point of the CH(B). But this implies that a point of
B must be in h+ . Namely, the subset A can not be realized by a halfspace, which implies that Q can not be
shattered. Thus dimVC (S ) < d + 2. It is also easy to verify that the regular simplex with d + 1 vertices is
being shattered by S . Thus, dimVC (S ) = d + 1.
for d > 1 (the cases where d = 0 or d = 1 are not interesting and we will just ignore them with contempt).
Note that for all n, d ≥ 1, we have Gd (n) = Gd (n − 1) + Gd−1 (n − 1)¯ .
Lemma 5.2.1 (Sauer’s Lemma) If (X, R) is a range space of VC-dimension d with |X| = n then |R| ≤ Gd (n).
Observe that |R| = |R x |+|R \ x|. Indeed, we charge the elements of R to their corresponding element in R \ x.
The only bad case is when there is a range r such that both r ∪ {x} ∈ R and r \ {x} ∈ R, because then these
two distinct ranges get mapped to the same range in R \ x. But such ranges contribute exactly one element
to R x .
Observe that (X \ {x} , R x ) has VC dimension d − 1, as the largest set that can be shattered is of size d − 1.
Indeed, any set B ⊂ X \ {x} shattered by R x , implies that B ∪ {x} is shattered in R.
Thus,
|R| = |R x | + |R \ x| ≤ Gd−1 (n − 1) + Gd (n − 1) = Gd (n),
by induction.
Definition 5.2.2 (Shatter function.) Given a range space S = (X, R), its shatter function πS (m) is the
maximum number of sets that might be created by S when restricted to subsets of size m. Formally,
πS (m) = max R|B
B⊂X
|B|=m
The shattering dimension of S is the smallest d such that πS = O(md ), for all m.
¯
Here is a cute (and standard) counting argument: Gd (n) is just the number of different subsets of size at most d out of n elements.
Now, we either decide to not include the first element in these subsets (i.e., Gd (n − 1)) or, alternatively, we include the first element
in these subsets, but then there are only d − 1 elements left to pick (i.e., Gd−1 (n − 1)).
65
By applying Lemma 5.2.1, to a finite subset of X, we get:
Corollary 5.2.3 If S = (X, R) is a range space of VC-dimension d then for every finite subset B of X, we
have R|B ≤ πS (|B|) ≤ Gd (|B|).
Namely, the VC-dimension of a range space always bound the shattering dimension of this range space.
Proof:
For the second part, let n = |B|, and observe that R|B ≤ Gd (n) ≤ nd , by Eq. (5.1). As such,
R|B ≤ nd , and, by definition, the shattering dimension of S is at most d; namely, the shattering dimension
is bounded by the VC-dimension.
Disks revisited. To see why the shattering dimension is more convenient to work with than VC-dimension,
consider the range space S = (X, R), where X = IR2 , and R is the set of disks in the plane. We know that the
VC-dimension of S is 3 (see Section 5.1.1).
Now, consider the shattering dimension of this range space. Let P be a set of n points in the plane, and
observe that given a disk in the plane, we can continuously deform it till it passes through three points of P
on its boundary, and no point outside the disk is now inside its interior, and vice versa. As such, the number
of subsets of P that one can realize by intersecting it with a disk is bounded by n3 23 (we pick the three
vertices that determine the disk, and for each of the three vertices we determine whether we consider it to
be inside the disk or not). As such, the shatter dimension of S is 3.
That might not seem like a great simplification over the same bound we got by arguing about the VC-
dimension. However, the above argumentation give us a very powerful tool – the shattering dimension of a
range space defined by a family of shapes is always bounded by the number of points that determine a shape
in the family. Thus, the shattering dimension of, say, arbitrarily oriented rectangles in the plane is five, since
such a rectangle is uniquely determined by five points.
To understand what the dual space is, consider X to be the plane, and R to
be a set of m disks. Then, in the dual range space S? = (R, X? ), every point
p in the plane has a set associated with it in X? , which is the sets of disks of
R that contains p. In particular, if we consider the arrangement formed by the q
p
m disks of R, then all the points lying inside a single face of this arrangement
correspond to the same set of X? . Namely, the number of ranges in X? is
bounded by the complexity of the arrangement of these disks, which is O(m2 ).
Thus, let the dual shatter function of the range space S be π?S (m) = πS? (m), Figure 5.1: R p = Rq .
where S? is the dual range space to S. The dual shattering dimension of S is
just the shattering dimension of S? .
Note, that the dual shattering dimension might be smaller than the shattering dimension or the VC-
dimension of the range space. Indeed, in the case of disks in the plane, the dual shattering dimension is
just 2, while the VC-dimension and the shattering dimension of this range space is 3. Note, also, that in
66
geometric settings bounding the dual shattering dimension is relatively easy, as all you have to do is to
bound the complexity of the arrangement of m ranges of this space.
The following lemma shows a connection between the VC-dimension of a space and its dual. The proof
is left as an exercise for the interested reader (see Exercise 5.7.1).
Lemma 5.2.4 Consider a range space S = (X, R) with VC-dimension d. The dual range space S? = (R, X? )
has VC dimension bounded by 2d .
Proof: Let B be a set of n points in X that are being shattered by b S. There are Gd (n) and Gd0 (n) different
0
assignments for the elements of B by ranges of R and R , respectively. Every subset C of B realized byb r∈b R,
0 0 0
is a union of two subsets B ∩ r and B ∩ r where r ∈ R and r ∈ R . Thus, the number of different subsets of
B realized by b
0
S is bounded by Gd (n)Gd0 (n). Thus, 2n ≤ nd nd , for d, d0 > 1. We conclude n ≤ (d + d0 ) lg n,
which implies that n ≤ c(d + d0 ) log(d + d0 ), for some constant c.
Corollary 5.2.6 Let S = (X,R) andT = (X, R0 ) be two range spaces of VC-dimension d and d0 , respectively,
R = r ∩ r0 r ∈ R, r0 ∈ R0 . Then, for the range space b
where d, d0 > 1. Let b S = (X, bR), we have that
dimVC (b
S) = O((d + d0 ) log(d + d0 ))
Proof: Observe that r ∩ r0 = r ∪ r0 , and thus the claim follows by Lemma 5.1.2 and Lemma 5.2.5.
In fact, we can summarize the above observations, as follows.
Corollary 5.2.7 Any finite sequence of combining range spaces with finite VC-dimension results in a range
space with a finite VC-dimension.
Namely, ε-sample is a subset of the ground set that “captures” the range space up to an error of ε.
Namely, to estimate (approximately) the fraction of the ground set covered by a range r, it is sufficient to
count the points of C that falls inside r.
If B = X and B is a finite set we will abuse notations slightly, and refers to C as a ε-sample for S.
The author is quite aware that the interest of the reader in this issue might not be the result of free choice. Nevertheless, one
might draw some comfort from the realization that the existence of the interested reader is as much of an illusion as the existence
of free choice. Both are convenient to assume, and both are probably false. Or maybe not.
67
To see the usage of such a sample, consider C = X to be, say, the population of a country (i.e., an
element of X is a citizen). A range in R is the set of all people in the country that answer yes to a question
(i.e., would you vote for party Y? would you buy a bridge from me? stuff like that). An ε-sample of this
range space enable us to estimate reliably (up to an error of ε) the answer for all these questions, by just
asking the people in the sample.
The natural question of course is to find such a subset of small (or minimal) size.
Theorem 5.3.2 (ε-sample theorem, [VC71]) There is a positive constant c such that if (X, R) is any range
space with VC-dimension at most d, B ⊆ X is a finite subset and ε, δ > 0, then a random subset C of
cardinality s of B where s is at least the minimum between |B| and
!
c d 1
d log + log
ε2 ε δ
is an ε-sample for B with probability at least 1 − δ.
Sometimes it is sufficient to have (hopefully smaller) samples with a weaker property – if a range is
“heavy” then there is an element in our sample that is in this range.
Definition 5.3.3 (ε-net) A set N ⊆ B is an ε-net for B, if for any range r ∈ R, if |r ∩ B| ≥ ε |B| implies that
r contains at least one point of N (i.e., r ∩ N , ∅).
Theorem 5.3.4 (ε-net theorem, [HW87]) Let (X, R) be a range space of VC-dimension d, let B be a finite
subset of X and suppose 0 < ε, δ < 1. Let N be a set obtained by m random independent draws from B,
where !
4 2 8d 8d
m ≥ max log , log . (5.2)
ε δ ε ε
Then N is an ε-net for B with probability at least 1 − δ.
The above two theorems also holds for spaces with shattering dimension at most d. The constant in the
sample size deteriorates a bit.
68
Theorem 5.3.2 tells us that if we pick (roughly) O((1/ε) log(1/ε)) random points in a sample R from
this distribution, compute the labels for the samples, and find the smallest disk D that contains the sampled
labeled by ‘1’ and does not contain any of the ‘0’ points, then the function g that returns ‘1’ inside the disk
and ‘0’ otherwise, correctly classifies all but ε-fraction of the points (i.e., the probability of misclassifying a
point picked according to the given distribution is smaller than ε).
To see that, consider the range space S having the plane as the ground set, and the symmetric difference
between two disks as the ranges. By Corollary 5.2.7, this range space has finite VC dimension. Now,
consider the (unknown) disk D0 that induces f and the region r = D ⊕ D0 . Clearly, the learned classifier g
returned incorrect answer only for points picked inside r.
So
h ithe probability for a mistake in the classification, is the measure of r under the distribution D. So, if
PrD r > ε then by the ε-net Theorem (i.e., Theorem 5.3.4) the set R is an ε-net for S (ignore for the time
being the possibility that the random sample fails to be an ε-net) and as such, R contains a point q inside r.
But then, it is not possible that g (which classifies correctly all the sampled points of R) make a mistake on
q. Ah contradiction,
i because by construction, the range r is where g misclassifies points. We conclude that
PrD r ≤ ε, as desired.
Tell me lies, tell me sweet little lies. The careful reader might be tearing her hair out because of the above
description. First, Theorem 5.3.4 might fail, and the above conclusion might not hold. This is of course
true, and in real applications one might use much larger sample to guarantee that the probability of failure
is so small that it can be practically ignored. A more serious issue is that Theorem 5.3.4 is defined only
for finite sets. Nowhere does it speak about a continuous distribution. Intuitively, one can approximate a
continuous distribution to an arbitrary precision using a huge sample, and apply the theorem to this sample
as our ground set. A formal proof is more tedious and requires extending the proof of Theorem 5.3.4 to
continuous distributions. This is straightforward and we will ignore this topic altogether.
A Naive Proof of the ε-Sample Theorem. To demonstrate why the ε-sample/net Theorems are interesting, let us try to prove
the ε-sample Theorem in the natural naive way. Thus, consider a finite range space S = (X, R) with shattering dimension d. And
consider a range r that contains, say, a p fraction of the points of X, where p ≥ ε. Consider a random sample R of r points from X,
picked with replacement.
P
Let pi be the ith sample point, and let Xi be an indicator variable which is one if and only if pi ∈ r. Clearly, ( i Xi )/r is an
estimate for p = |r ∩ X| / |X|. We wouldP like this estimate to be within ±ε of p, and with confidence ≥ 1 − δ.
P
As such, the sample failed if ri=1 Xi − pr ≥ ∆ = εr = (ε/p)pr. Set φ = ε/p and µ = E i Xi = pr. Using the Chernoff
inequality (Theorem 25.2.6 and Theorem 25.2.9) we have
r r
X X
Pr
Xi − pr ≥ (ε/p)pr = Pr Xi − µ ≥ φµ ≤ exp −µφ2 /2 + exp −µφ2 /4
i=1 i=1
!
ε2
≤ 2 exp −µφ2 /4 = 2 exp − r ≤ δ,
4p
& ' & '
4 1 4p 1
for r ≥ 2 ln ≥ 2 ln .
ε δ ε δ
Viola! We had proved the ε-sample Theorem. Well, not quite. We proved that the sample works correctly for a single range.
The problem is that the number of ranges that we need to prove the theorem for is πS (|X|) (see Definition 5.2.2). In particular, if we
plug confidence δ/πS (|X|) to the above analysis, and use the union bound, we get that for
& '
4 πS (|X|)
r ≥ 2 ln
ε δ
the sample estimates correctly (up to ±ε) the size of all ranges with confidence
≥ 1 − δ. Bounding
πS (|X|) by O(|X|d ) (using Eq. (5.1)
−2
for a space with VC-dimension d), we can bound the required size of r by O dε log(|X| /δ) .
Namely, the “naive” argumentation gives us a sample bound which depends on the underlying size of the ground set. However,
the sample size in the ε-sample Theorem (Theorem 5.3.2) is independent of the size of the ground set. This is the magical property
of the ε-sample Theorem® .
69
5.3.3 A quicky proof of Theorem 5.3.4
Here we provide a sketchy proof of Theorem 5.3.4, which conveys the main ideas. The full proof in all its glory and details is
provided in Section 5.5.
Let N = (x1 , . . . , xm ) be the sample obtained by m independent samples from A. Let E1 be the probability that N fails to be an
ε-net. Namely,
E1 = ∃r ∈ R |r ∩ A| ≥ εn and r ∩ N = ∅ .
To complete the proof, we must show that Pr[E1 ] ≤ δ. Let T = (y1 , . . . , ym ) be another random sample generated in a similar
fashion to N. Let E2 be the event that N fails, but T “works”, formally
εm
E2 = ∃r ∈ R |r ∩ A| ≥ εn, r ∩ N = ∅ and |r ∩ T | ≥ .
2
Intuitively, since ET [|r ∩ T |] ≥ εm, then for the range r that N fails for, we have with “good” probability that |r ∩ T | ≥ εn/2.
Namely, Pr[E1 ] ≈ Pr[E2 ].
Next, let
εm
E20 = ∃r ∈ R r ∩ N = ∅, |r ∩ T | ≥ ,
2
h i
Clearly, E2 ⊆ E20 and as such Pr[E2 ] ≤ Pr E20 . Now, fix Z = N ∪ T , and observe that |Z| = 2m. Next, fix a range r, and observe
that the bad probability of E20 is maximized if |r ∩ Z| = εm/2. Now, the probability that all the elements
of r ∩ Z falls only into the
second half the sample is at most 2−εm/2 as a careful calculation shows. Now, there are at most Z|R ≤ Gd (2m) different ranges that
h i
one has to consider. As such, Pr[E1 ] ≈ Pr[E2 ] ≤ Pr E20 ≤ Gd (2m)2−εm/2 and this is smaller than δ, as a careful calculation shows,
by just plugging the value of m into the right hand side.
5.4 Discrepancy
The proof of the ε-sample/net Theorem is somewhat complicated. It turns out that one can get a somewhat similar result by
attacking the problem from the other direction; namely, let us assume that we would like to take a truly large sample of a finite
range space S = (X, R) defined over n elements with m ranges. We would like this sample to be as representative as possible as far
as S is concerned. In fact, let us decide that we would like to pick exactly half of the points of X to our sample (assume that n = |X|
is even).
To this end, let us color half of the points of X by −1 (i.e., black) and the other half by 1 (i.e., white). If for every range, r ∈ R,
the number of black points inside it is equal to the number of white points, then doubling the number of black points inside a range,
gives us the exact number of points inside the range. Of course, such a perfect coloring is unachievable in almost all situations. To
see this, consider the clique graph K4 – clearly, in any coloring of the vertices, there must be an edge with two endpoints having the
same color.
Formally, let χ : X → {−1, 1} be the given coloring. The discrepancy of χ over a range r is the amount of imbalance in the
coloring inside χ. Namely,
X
|χ(r)| = χ(p) .
p∈r
The overall discrepancy of χ is disc(χ) = maxr∈R |χ(r)|. The discrepancy of a (finite) range space S = (X, R) is the discrepancy of
the best possible coloring; namely,
disc(S) = min disc(χ).
χ:X→{−1,+1}
The natural question is, of course, how to compute the coloring χ of minimum discrepancy. This seems like a very challenging
question, but when you do not know what to do, you might as well do something random. So, let us pick a random coloring χ of
X. To this end, let P be an arbitrary partition of X into pairs (i.e., a perfect matching). For a pair {p, q} ∈ P, we will either color
χ(p) = −1 and χ(q) = 1, or the other way around; namely, χ(p) = 1 and χ(q) = −1. We will decide how to color this pair using
a single coin flip. Thus, our coloring would be induced by making such a decision for every pair of P, and let χ be the resulting
coloring. We will refer to χ as compatible with the partition P if χ({p, q}) = 0, for all {p, q} ∈ P.
70
Consider a range r. If a pair {p, q} ∈ P falls completely inside r or completely outside r than it does not
contribute anything to the discrepancy of r. Thus, the only pairs that contribute to the discrepancy of r are the crossing
ones that cross it. Namely, {p, q} ∩ r , ∅ and {p, q} ∩ (X \ r) , ∅. In particular, let #r denote the number of
pair
crossing pairs for r, and let Xi ∈ {−1, +1}
√ be the indicator variable which is the contribution of the ith crossing
pair to the discrepancy of r. For ∆r = 2#r ln(4m), we have by the Chernoff inequality (Theorem 25.2.1), that
!
X ∆2 1
Pr |χ(r)| ≥ ∆r ≤ 2 Pr Xi ≥ ∆r ≤ 2 exp − r = .
i
2# r 2m r
Since there are m ranges in R, it follows that with good probability (i.e., at least half) for all r ∈ R the discrepancy
of r is at most ∆r .
Theorem 5.4.1 Let S = (X, R) be a range space defined over n = |X| elements with m = |R| ranges. Consider any partition P of the
elements of X into pairs. Then, with probability ≥ 1/2, for any range r ∈ R, a random coloring χ : X → {−1, +1} that is compatible
with the partition P, has discrepancy at most p
|χ(r)| ≤ ∆r = 2#r ln(4m),
√
where #r denote the number of pairs of P that crosses r. In particular, since #r ≤ |r|, we have |χ(r)| ≤ 2 |r| ln(4m).
Observe that for every range r it holds #r ≤ n/2, since 2#r ≤ |X|. As such, we have:
Corollary 5.4.2 Let S = (X, R) be a range space defined over n elements√with m ranges. Let P be an arbitrary partition of X into
pairs. Then a random coloring which is compatible with P has disc(χ) ≤ n ln(4m), with probability ≥ 1/2.
One can easily amplify the probability of success of the coloring, by increasing the threshold. In particular, for any constant
c ≥ 1, one has that p
∀r ∈ R |χ(r)| ≤ 2c #r ln(4m),
2
with probability ≥ 1 − .
(4m)c
Lemma 5.4.3 Let Q ⊆ P be a δ-sample for P (in some underlying range space S), and let R ⊆ Q be a ρ-sample for Q. Then R is
a (δ + ρ)-sample for P.
71
P
By Lemma 5.4.3, we have that Pk is a ( ki=1 δi )-sample for P. Since we would like the smallest set in the sequence P1 , P2 , . . . that
P
is still an ε-sample, we would like to find the maximal k, such that ( ki=1 δi ) ≤ ε. We thus require that
s s s
X k Xk Xk
d ln(n/2i−1 ) d ln(n/2k−1 ) d ln nk−1
δi = τ(ni−1 ) = c ≤ c1 = c1 ≤ ε,
i=1 i=1 i=1
n/2i−1 n/2k−1 nk−1
where c1 is a sufficiently large constant. This holds for nk−1 ≥ (4c12 d/ε2 ) ln(c1 d/ε) as can be verified by an easy calculation. In
particular, taking the largest k for which this holds, results in a set Pk of size O((d/ε2 ) ln(d/ε)) which is an ε-sample for P.
Theorem 5.4.4 (ε-sample via discrepancy.) There is a positive constant c such that if (X, R) is any range space
with shattering
dimension at most d, B ⊆ X is a finite subset and ε, δ > 0, then there exists a subset C ⊆ B, of cardinality O (d/ε2 ) ln(d/ε) , such
that C is an ε-sample for B.
for some constant c, since the crossing number of a range r ∩ Pi−1 is always bounded by its size. This is equivalent to
i−1 p
2 ν − 2i ν ≤ c2i−1 dν ln n .
i−1 i i−1 i−1 (5.4)
We need the following technical claim that states the size of νk behaves as we expect: As long as the set Pk is large enough,
the size of νk is roughly ν0 /2k .
Claim 5.4.5 There is a constant c4 (independent of d), such that for all k with ν0 /2k ≥ c4 d ln nk , it holds (ν0 /2k )/2 ≤ νk ≤ 2(ν0 /2k ).
Proof: The proof is by induction. For k = 0 the claim trivially holds. Assume that it holds for i < k. Adding up the
inequalities of Eq. (5.4), for i = 1, . . . , k, we have that
r r
X k p X
k
ν0 ν0
ν0 − 2k νk ≤ c2i−1 dνi−1 ln ni−1 ≤ c2i−1 2d i−1 ln ni−1 ≤ c3 2k d k ln nk ,
i=1 i=1
2 2
Theorem 5.4.6 (ε-net via discrepancy.) There is a positive constant c such that if (X, R) is any range space with shattering di-
mension at most d, B ⊆ X is a finite subset and ε, δ > 0, then there exists a subset C ⊆ B, of cardinality O((d/ε) ln(d/ε)), such that
C is an ε-net for B.
72
5.5 Proof of the ε-net Theorem
In this section, we finally prove Theorem 5.3.4.
Let (X, R) be a range space of VC-dimension d, and let A be a subset of X of cardinality n. Suppose that m satisfies Eq. (5.2)p68 .
Let N = (x1 , . . . , xm ) be the sample obtained by m independent samples from A (the elements of N are not necessarily distinct, and
this is why we treat N as an ordered set). Let E1 be the probability that N fails to be an ε-net. Namely,
E1 = ∃r ∈ R |r ∩ A| ≥ εn and r ∩ N = ∅ .
(Namely, there exists a “heavy” range r that does not contain any point of N.) To complete the proof, we must show that Pr[E1 ] ≤ δ.
Let T = (y1 , . . . , ym ) be another random sample generated in a similar fashion to N. Let E2 be the event that N fails, but T “works”,
formally
εm
E2 = ∃r ∈ R |r ∩ A| ≥ εn, r ∩ N = ∅ and |r ∩ T | ≥ .
2
Intuitively, since ET [|r ∩ T |] ≥ εm, then for the range r that N fails for, we have with “good” probability that |r ∩ T | ≥ εn/2.
Namely, E1 and E2 have more or less the same probability.
Proof: Clearly, E2 ⊆ E1 , and thus Pr[E2 ] ≤ Pr[E1 ]. As for the other part, note that by the definition of conditional probability,
we have
Pr E2 E1 = Pr[E2 ∩ E1 ] / Pr[E1 ] = Pr[E2 ] / Pr[E1 ] .
It is thus enough to show that Pr E2 E1 ≥ 1/2.
Assume that E1 occur. There is r ∈ R, such that |r ∩ A| > εn and r ∩ N = ∅. The required probability is at least the
probability that for this specific r, we have |r ∩ T | ≥ εn2 . However, |r ∩ T | is a binomial variable with expectation εm, and variance
ε(1 − ε)m ≤ εm. Thus, by Chebychev inequality (Theorem 25.1.2), it holds
" √εm √ # !2
εm εm 2 1
Pr |r ∩ T | < ≤ Pr |r ∩ T | − εm > = Pr |r ∩ T | − εm > εm ≤ √ ≤ ,
2 2 2 εm 2
Pr[E2 ] h i εm 1
εn
≥ Pr |r ∩ T | ≥ 2
= 1 − Pr |r ∩ T | < ≥ .
Pr[E1 ] 2 2
Proof: We imagine that we sample the elements of N ∪ T together, by picking Z = (z1 , . . . , z2m ) independently from A. Next,
we randomly decide the m elements of Z that go into N, and the remaining elements go into T . Clearly,
X 0
Pr E20 = Pr E2 Z Pr[Z] .
Z
Thus, from this point on, we fix the set Z, and we bound Pr[E20 |Z] (note, that Pr[E20 ] can be interpreted as averaging Pr[E20 |Z], thus
a bound on this quantity would imply the same bound on Pr[E20 ]).
It is now enough to consider the ranges in the projection space (Z, R|Z ). By Lemma 5.2.1, we have R|Z ≤ Gd (2m).
Let us fix any r ∈ R|Z , and consider the event
εm
Er = r ∩ N = ∅ and |r ∩ T | > .
2
73
For k = |r ∩ (N ∪ T )|, we have
2m−k
Pr[Er ] ≤ Pr r ∩ N = ∅ |r ∩ (N ∪ T )| > εm = m
2 2m
m
(2m − k)(2m − k − 1) · · · (m − k + 1)
=
2m(2m − 1) · · · (m + 1)
m(m − 1) · · · (m − k + 1)
= ≤ 2−k ≤ 2−εm/2 .
2m(2m − 1) · · · (2m − k + 1)
Thus, X
Pr E20 Z ≤ Pr[Er ] ≤ Z|R 2−εm/2 ≤ Gd (2m)2−εm/2 ,
r∈Z|R
h i
implying that Pr E20 −εm/2
≤ Gd (2m)2 .
Proof of Theorem 5.3.4. By Lemma 5.5.1 and Lemma 5.5.2, we have Pr[E1 ] ≤ 2Gd (2m)2−εm/2 . It is thus remains to verify that
if m satisfies Eq. (5.2), then 2Gd (2m)2−εm/2 ≤ δ.
P Pd (2m)i
Indeed, we know that 2m ≥ 8d and as such Gd (2m) = di=0 2m i
≤ i=0 i! ≤ (2m)d , for d > 1. Thus, it is sufficient to show
d −εm/2
that the inequality 2(2m) 2 ≤ δ holds. By taking lg of both sides and rearranging, we have that this is equivalent to
εm 2
≥ d lg(2m) + lg .
2 δ
By our choice of m (see Eq. (5.2)), we have that εm/4 ≥ lg(2/δ). Thus, we need to show that
εm
≥ d lg(2m).
4
8d 8d
We verify this inequality for m = ε
lg ε
. Indeed
!
8d 16d 8d
2d lg ≥ d lg lg .
ε ε ε
!2
8d 16d 8d 4d 8d
This is equivalent to ≥ lg . Which is equivalent to ≥ lg , which is certainly true for 0 ≤ ε ≤ 1 and d > 1.
ε ε ε ε ε
This completes the proof of the theorem.
74
5.6.1 Variants and extensions
A natural application of the ε-sample theorem is to use it to estimate the weights of ranges. In particular, given a finite range space
(X, R) we would like to build a data-structure such that we can decide quickly, given a query range r, what is the number of points
of X inside r. We could always use a sample of size (roughly) O(ε−2 ) to get an estimate of the weight of a range, using the ε-sample
theorem. The error of the estimate is εn, where n = |X|; namely, the error is additive. The natural question is whether one can get
an additive estimate ρ, such that pn ≤ ρ ≤ (1 + ε)pn, where |r ∩ X| = pn.
In particular, a subset A ⊂ X is a (relative) (ε, p)-sample, if for each r ∈ R of weight exceeding pn, it holds
|r ∩ A| − |r ∩ X| ≤ ε |r ∩ X| .
|A| |X| |X|
2
Of course, one
√ can simply generate a εp-sample of size (roughly) O(1/(εp) ) by the ε-sample theorem. This is not very interesting
when p = 1/ n. Interestingly, the dependency on p can be improved.
Theorem 5.6.1 ([LLS01]) Let (X, R) be a range space with shattering dimension d, where |!X| = n, and let 0 < ε < 1 and 0 < p < 1
c 1 1
be given parameters. Then, consider a random sample A ⊆ X of size 2 d log + log , where c is a constant. Then, it holds
ε p p δ
that for each range r ∈ R of at least pn points, we have
|r ∩ A| − |r ∩ X| ≤ ε |r ∩ X| .
|A| |X| |X|
In other words, A is a (p, ε)-sample for (X, R). The probability of success is ≥ 1 − δ.
5.7 Exercises
Exercise 5.7.1 (On the VC-dimension of the dual range space.) [5 Points]
Prove Lemma 5.2.4. Namely, given a range space S = (X, R) of VC-dimension d, prove that the dual range space S? = (R, X? )
has VC-dimension bounded by 2d .
[Hint: Represent a finite range space as a matrix, where the elements are the columns, and the sets are the rows. Interpret the
fact that a set of size d can be shattered in this setting. What is the dual range space in this settings?]
(b) [5 Points] Let Ψ0 be the set of all permutations of 1, . . . , 2m. Prove that for a random σ ∈ Ψ0 , we have
" Pm Pm #
i=1 bσ(i) i=1 bσ(i+m) 2
Pr − ≥ ε ≤ 2e−Cε m/2 ,
m m
where C is an appropriate constant. [Hint: Use (a), but be careful.]
(c) [10 Points] Prove Theorem 5.3.2 using (b).
75
76
Chapter 6
“I’ve never touched the hard stuff, only smoked grass a few times with the boys to be polite, and that’s all, though
ten is the age when the big guys come around teaching you all sorts to things. But happiness doesn’t mean much to
me, I still think life is better. Happiness is a mean son of a bitch and needs to be put in his place. Him and me aren’t
on the same team, and I’m cutting him dead. I’ve never gone in for politics, because somebody always stand to gain
by it, but happiness is an even crummier racket, and their ought to be laws to put it out of business.”
– – Momo, Emile Ajar
Given a set S of n segments in the plane, and a subset R ⊆ S, let A| (R) de-
note the vertical decomposition of the plane, formed by the arrangement A(R) of
the segments of R. This is the partition of the plane into interior disjoint vertical
trapezoids formed by erecting vertical walls through each vertex of A| (R). For-
mally, a vertex of A| (R) is either an endpoint of a segment of R, or an intersection
point of two of its segments. From each such vertex we shoot up (similarly, down)
a vertical ray till it hits a segment of R, or it continues all the way to infinity. See
figure on the right. σ
Note, that a vertical trapezoid is defined by at most 4 segments: two seg-
ments defining its ceiling and floor, and two segments defining the two inter-
section points that induce the two vertical walls on its boundary. Of course, a
vertical trapezoid might be degenerate, and thus defined by less segments (i.e., an
unbounded vertical trapezoid, or a triangle).
Vertical decomposition breaks the faces of the arrangement, that might be
arbitrarily complicated, into entities (i.e., vertical trapezoids) of constant com-
plexity. This make handling arrangements much easier computationally.
In the following, we assume that the segments of S have k intersection points overall, and we want to compute the arrangement
A = A(S); namely, compute the edges, vertices and faces of A(S). One possible way of doing it, is the following: Compute a
random permutation of the segments of S: S = hs1 , . . . , sn i. Let Si = hs1 , . . . , si i be the prefix of length i of S. Compute A| (Si )
from A| (Si−1 ), for i = 1, . . . , n. Clearly, A| (S) = A| (Sn ), and we can extract A(S) from it.
Randomized Incremental Construction (RIC). Imagine that we had computed the arrangement Bi−1 = A| (Si−1 ). In
the ith iteration we compute Bi by inserting si into the arrangement Bi−1 . This involves splitting some trapezoids (and merging
some others),
77
As a concrete example, consider the example on the right. Here we insert s in q
the arrangement. To this end we split the “vertical trapezoids” 4pqs and 4rqs, each
into three trapezoids. The two trapezoids σ0 and σ00 needs to be now merged together s
to form the new trapezoid which appears in the vertical decomposition of the new
arrangement. (Note, that the figure does not show all the trapezoids in the vertical
decomposition.) σ0 σ00
To facilitate this, we need to compute the trapezoids of Bi−1 that intersects si .
This is done by maintaining a conflict-graph. Each trapezoid σ ∈ A| (Si−1 ) maintains s
p r
a conflict list cl(σ) of the segments of S that intersects its interior. We also maintain
|
a similar structure for each segment, listing all the trapezoids of A (Si−1 ) that it currently intersects (in its interior). We maintain
those lists with cross-pointers, so that given an entry (σ, s) in the conflict-list of σ, we can find the entry (s, σ) in the conflict list of
s in constant time.
Thus, given si , we know what are the trapezoids that needs to be split (i.e., all the trapezoids in cl(si )).
Splitting a trapezoid σ by a segment si is the operation of computing a set of (at most) 4 trapezoids that
covers σ and have si on their boundary. We compute those new trapezoids, and next we need to compute the
conflict-lists of the new trapezoids. This can be easily done by taking the conflict-list of a trapezoid σ ∈ cl(si )
and distributing its segments among the O(1) new trapezoids that covers σ. Using careful implementation
this requires a linear time in the size of the conflict-list of σ.
In the above description, we ignored the need to merge adjacent trapezoids if they have identical floor s
i
and ceiling - this can be done by a somewhat straightforward and tedious implementation of the vertical-
decomposition data-structure, by providing pointers between adjacent vertical trapezoids, and maintaining the conflict-list sorted
(or by using hashing) so that merge operations can be done quickly. This is somewhat tedious but it can be done in linear time in
the input/output size involved as can be verified.
Claim 6.1.1 The (amortized) running time of constructing Bi from Bi−1 is proportional to the size of the conflict lists of the vertical
trapezoids in Bi \ Bi−1 (and the number of such new trapezoids).
Proof: We charge all the work involved in the ith iteration either to the conflict lists of the newly created trapezoids, or the deleted
conflict lists. Clearly, the running time of the algorithm in the ith iteration is linear in the total size of these conflict lists. Observe,
that every conflict get charged twice – when it is being created, and when it is being deleted. As such, the (amortized) running time
in the ith iteration is proportional to the total length of the newly created conflict lists.
Thus, to bound the running time of the algorithm, it is enough to bound the expected size of the destroyed conflict-lists in ith
iteration (and sum this bound on the n iterations carried out by the algorithm). Or alternatively, bound the expected size of the
conflict-lists created in the ith iteration.
Lemma 6.1.2 Let S be a set of n segments (in general position° ) with k intersection points. Let Si be the first i segments in a
random permutation of S. The expected size of Bi = A| (Si ), denoted by τ(i), (i.e., number of trapezoids in Bi ) is O i + k(i/n)2 .
Proof: Consider an intersection point p = s ∩ s0 , where s, s0 ∈ S. The probability that p is present in A| (Si ) is equivalent to the
probability that both s and s0 are in Si . This probability is
n−2
i−2 (n − 2)! i! (n − i)! i(i − 1)
α = n = · = .
(i − 2)! (n − i)! n! n(n − 1)
i
°
In this case, no two intersection points are the same, no two intersection points (or vertices) have the same x-coordinate, no
two segments lie on the same line, etc. Making geometric algorithm work correctly for all degenerate inputs is a huge pain that can
usually be handled by tedious and careful handling. Thus, we will always assume general position of the input. In other words, in
theory all geometric inputs are inherently good, while in practice they are all evil (as anybody that tried to implement geometric
algorithms can testify). The reader is encouraged not to use this to draw any conclusions on the human condition.
The proof is provided in excruciating detail to get the reader used to this kind of argumentation. I would apologize for this
pain, but it is a minor trifle, not to be mentioned, when compared to the other crimes in this manuscript.
78
where V is the set of k intersection points of A(S). Thus, since every endpoint of a segment of Si contributed its two endpoints to
the arrangement A(Si ), we have that the expected number of vertices in A(Si ) is
i(i − 1)
2i + k.
n(n − 1)
Now, the number of trapezoids in A| (Si ) is proportional to the number of vertices of A(Si ), which implies the claim.
As such, the expected overall size of the conflict-lists created in the ith iteration is
X 4
E Ci Bi ≤ cl(σ) ≤ 4 Wi .
σ∈B
i i
i
2 2
By Lemma 6.1.2, the expected size of Bi is O(i + ki /n ). Let us guess (for the time being) that on average the size of the conflict
list of a trapezoid of Bi is about O(n/i). In particular, assume that we know that
! !
i2 n i
E[Wi ] = O i + 2 k =O n+k ,
n i n
by Lemma 6.1.2. Implying
" # !! !
E Ci = E E Ci Bi = E 4 Wi = 4 E[Wi ] = O 4 n + ki = O n + k .
i i i n i n
In particular, the expected amount of work in the ith iteration is proportional to E[Ci ]. Thus, the overall expected running time of
the algorithm is n !
X X n
n k
E Ci = O + = O(n log n + k).
i=1 i=1
i n
Theorem 6.1.3 Given a set S of n segments in the plane with k intersections, one can compute the vertical decomposition of A(S)
in expected O(n log n + k) time.
Intuition and discussion. What remains to be seen, is how we came up with the guess that the average size of a conflict-list
of a trapezoid of Bi is about O(n/i). Note, that ε-nets imply that the bound O((n/i) log i) holds with constant confidence (see
Theorem 5.3.4), so this result is only slightly surprising. To prove this, we present in the next section a “strengthening” of ε-nets to
geometric settings.
To get an intuition how we came up with this guess, consider a set P of n points on the line, and a random sample R of i points
from P. Let I b be the partition of the real line into open intervals by the endpoints of R that do not contain points of R in their
interior.
b (i.e., a one dimensional trapezoid) of I.
Consider a interval of I b It is intuitively clear that this interval (in expectation) would
contain O(n/i) points. Indeed, fix a point x on the real line, and imagine that we pick each point with probability i/n to the random
sample. The random variable which is the number of points of P we have to scan to the right of x till we “hit” a point that is in the
random sample behaves like a geometric variable with probability i/n, and as such its expected value is n/i. The same argument
works if we scan P to the left of x. We conclude that the number of points of P in the interval of I b that contains x but does not
contain any point of R is O(n/i) in expectation.
Of course, the vertical decomposition case is more involved. Instead of proving the required result for this case, we will prove
a more general result which can be applied in a lot of other settings.
79
6.2 General Settings
Let S be a set of objects. For a subset R ⊆ S, we define a collection of ‘regions’ called F (R). For vertical decomposition of
segments (i.e., Theorem 6.1.3), the objects are segments, the regions are trapezoids, and F (R) is the set of vertical trapezoids
forming A| (R). Let [
T = T (S) = F ( R)
R⊆S
denote the set of all possible regions defined by subsets of S. We associate two subsets D(σ), K(σ) ⊆ S with each region σ ∈ T .
®
The defining set D(σ) of σ is a subset of S defining the region σ (the precise a
requirements from this set are specified in the axioms below). We assume that for
c
every σ ∈ T , |D(σ)| ≤ d for a (small) constant d. The constant d is sometime referred
to as the combinatorial dimension. In the case of Theorem 6.1.3, each trapezoid σ e
is defined by at most 4 segments (or lines) of S that define the region covered by the σ
trapezoid σ, and this set of segments is D(σ). See figure on the right.
The killing set K(σ) of σ is the set of objects of S such that including any object d b
of K(∆) into R prevents σ from appearing in F (R) (i.e., the killing set is the conflict
f
list of σ, if σ is being created by the RIC algorithm). In many applications K(σ) is
just the set of objects intersecting the cell σ; this is also the case in Theorem 6.1.3, Figure 6.1: D(σ) = {b, c, d, e}
where K(σ) is the set of segments of S intersecting the interior of the trapezoid σ, see
Figure 6.1. The weight of σ is ω(σ) = |K(σ)|.
and K(σ) = { f }.
Let S, F (R) , D(σ), and K(σ) be such that for any subset R ⊆ S, the set F (R) satisfies the following axioms:
(i) For any σ ∈ F (R), it holds D(σ) ⊆ R and R ∩ K(σ) = ∅.
(ii) If D(σ) ⊆ R and K(σ) ∩ R = ∅, then σ ∈ F (R).
For any natural number r and a number t > 0, consider R to be a random sample of size r from S, and let’s denote
n
Ft (R) = σ ∈ F (R) ω(σ) ≥ t · .
r
This is the set of regions in F (R) with a weight that is t times larger than what we expect¯ . We intuitively expect the size of this
set to drop fast as t increases. So, let
#(r) = E F (R) and #t (r) = E Ft (R) ,
where the expectation is over random subsets R ⊆ S of size r. Note, that #(r) = #0 (r) is just the expected number of regions of a
random sample of size r. In words, #t (r) is the expected number of regions in a structure created by r random objects, such that
these regions have weight which is t times larger than the “expected” weight of n/r.
Let [
Tt (r) = F t ( R)
R⊆S,|R|=r
denote the set all t-heavy regions that might be created by a sample of size r.
In the following, S is a set of n objects complying with Axioms (i) and (ii), and
d = max |D(σ)|
σ∈T(S)
Lemma 6.2.1 Let r ≤ n and t be parameters, such that 1 ≤ t ≤ r/d. Furthermore, let R be a sample of size r, R0 be a sample of size
r0 = br/tc, both from S. Let σ ∈ T be a trapezoid with weight ω = ω(σ) ≥ t(n/r). Then, Pr[σ ∈ F (R)] = O td 2−t Pr[σ ∈ F (R0 )] .
Intuitively, (but not quite correctly) Lemma 6.2.1 states that the probability of a t-heavy trapezoid to be created drops expo-
nentially with t.
An Almost Proof of Lemma 6.2.1. We provide a back of the envelope argument “proving” Lemma 6.2.1. A more formal proof is
provided in Section 6.5.
Let us pick R (resp., R0 ) by picking each element of S with probability p = r/n (resp. p = r0 /n). Note, that this sampling
is different than the one used by the lemma, but it provides samples having roughly the same size, and we expect the relevant
®
Paraphrasing Voltaire, this does not imply that a member of T lives in the best of all possible sets.
¯
These are the regions that are t times overweight. Speak about an obesity problem.
80
probabilities to remain roughly the same. Let δ = |D(σ)| and ω = ω(σ). We have that Pr[σ ∈ F (R)] = pδ (1 − p)ω (i.e., this is the
probability that we pick the elements of D(σ) to the sample, and do not pick any of the elements of K(σ) to the sample, which is
by Axiom (ii) exactly the event σ ∈ F (R)). Similarly, Pr[σ ∈ F (R0 )] = p0 δ (1 − p0 )ω . As such,
!δ !ω n − r ω
Pr[σ ∈ F (R)] pδ (1 − p)ω r/n 1 − r/n
α= = = ≤ (t + 1)d .
Pr[σ ∈ F (R )]
0 δ
p (1 − p )
0 0 ω r /n 1 − r /n
0 0 n − r0
Now,
n − r ω !ω !ω ! !ω ! !
r0 − r r/t − r 1 r 1 r
= 1+ ≤ 1+ ≤ 1 − 1 − ≤ exp − 1 − ω
n − r0 n − r0 n − r0 t n t n
! !
1 r n
≤ exp − 1 − t ≤ exp(−(t − 1)),
t n r
Since the formal proof is less enlightening than the above “almost proof”, we delegate it to the end of the chapter, see Sec-
tion 6.5. The following exponential decay lemma testifies that truly heavy regions are (exponentially) rare.
Lemma 6.2.2 Given a set S of n objects. Let r ≤ n and t be parameters, such that 1 ≤ t ≤ r/d, where d = maxσ∈T(S) |D(σ)|.
Assuming that Axioms (i) and (ii) above hold for any subset of S, then we have
r
#t (r) = O td 2−t # = O td 2−t #(r) . (6.1)
t
Proof: Let R be a random sample of size r from S and R0 be a random sample of size r0 = br/tc from S. Let Tt = Tt (r). We
have
X X
#t (r) = E |Ft (R)| = Pr[σ ∈ F (R)] = Ot 2d −t
Pr σ ∈ F (R )
0
σ∈Tt σ∈Tt
d −t X 0
=
Ot 2 Pr σ ∈ F (R ) = O td 2−t #(r0 ) ,
σ∈T
by Lemma 6.2.1.
Theorem 6.2.3 Let R ⊆ S be a random subset of size r. Let #(r) = E[|F (R) |] and c ≥ 1 be an arbitrary constant. Then,
X n c
c
E (ω(σ)) = O #(r) .
σ∈F(R)
r
81
6.3 Applications
6.3.1 Analyzing the RIC Algorithm for Vertical Decomposition
As shown in Lemma 6.1.2, #(i) = O i + k(i/n)2 . Thus, by Theorem 6.2.3, we have that
X n
E ω(σ) = O τ(i) = O(n + ki/n) .
σ∈B
i
i
6.3.2 Cuttings
Let S be a set of n lines in the plane, and let r be an arbitrary parameter. A (1/r)-cutting of S is a partition of the plane into constant
complexity regions such that each region intersect at most n/r lines of S. It is natural to try and minimize the number of regions in
the cutting, as cuttings are a natural tool for performing divide and conquer.
Consider the range space having S as its ground set, and vertical trapezoids as its ranges (i.e., given a vertical trapezoid σ, its
corresponding range is the set of all lines of S that intersects the interior of σ). This range space has a VC dimension which is a
constant as can be easily verified. Let X ⊆ S be a ε-net for this range space, for ε = 1/r. By Theorem 5.3.4 (ε-net theorem), there
exists such an ε-net X, of this range space, of size O(1/ε log(1/ε)) = O(r log r). Consider a vertical trapezoid σ in the arrangement
A| (X). It does not intersect any of the lines of X in its interior, and X is an ε-net for S. It follows, that σ intersects at most εn = n/r
lines of S in its interior. Since the arrangement A| (X) has complexity O(|X|2 ), we get the following result.
Lemma 6.3.1 There exists (1/r)-cutting of a set of segments in the plane of size O (r log r)2 .
Since an arrangement of n lines has at most n2 intersects, and the number of intersections of the lines intersecting a single
n/r
region in the cutting is at most 2 , this implies that any cutting must be of size Ω(r2 ). We can get cuttings of such size easily using
the moments technique.
Theorem 6.3.2 Let S be a set of n lines in the plane, and let r be a parameter. One can compute a (1/r)-cutting of S of size O(r2 ).
Proof: Let R be a random sample of size r, and consider its vertical decomposition A| (R). If a vertical trapezoid σ ∈ A| (R)
intersects at most n/r lines of S, then we can add it to the output cutting. The other possibility is that a σ intersects t(n/r) lines of
S, for some t > 1, and let cl(σ) ⊂ S be the conflict list of σ (i.e., the list of lines of S that intersect the interior of σ). Clearly, a
(1/t)-cutting for the set cl(σ) forms a vertical decomposition (clipped inside σ) such that each trapezoid in this cutting intersects at
most n/r lines of S. Thus, we compute such a cutting inside each such “heavy” trapezoid using the algorithm
Lemma
6.3.1, and
these subtrapezoids to the resulting cutting. Clearly, the size of the resulting cutting inside σ is O t2 log2 t = O t4 . The resulting
two-level partition is clearly the required cutting. By Theorem 6.2.3, the expected size of the cutting is
X ω(σ) !4 r 4 X
O#(r) + E 2 = O#(r) +
E (ω(σ))4
σ∈F(R)
n/r n σ∈F(R)
r 4 n 4 !
= O #(r) + · #(r) = O(#(r)) = O r2 ,
n r
82
the vertical decomposition of a single face in an arrangement. Here an insertion of a faraway segment the random sample might cut
off a portion of the face of interest. In particular, in the settings of Agarwal et al. (ii) is replaced by
Interestingly, Clarkson [Cla88] did not prove Theorem 6.2.3 using the exponential decay lemma, but gave a direct proof.
Although, his proof implicitly contains the exponential decay lemma. We chosen the current exposition since it is technically only
slightly more challenging but provides a better intuition of what is really going on.
The exponential decay lemma (Lemma 6.2.2), was proved by Chazelle and Friedman [CF90]. The work of [AMS94] is a
further extension of this result. Another analysis was provided by Clarkson et al. [CMS93].
Another way to reach similar results, is using the technique of Mulmuley [Mul94a], which relies on a direct analysis on
‘stoppers’ and ‘triggers’. This technique is somewhat less convenient to use but is applicable to some settings where the moments
technique does not apply directly. Mulmuley came up with randomized incremental construction of vertical decomposition. Also,
his concept of the omega function might explain why randomized incremental algorithms perform better in practice than their worst
case analysis [Mul94b].
Backwards analysis in geometric settings was first used by Chew [Che86], and formalized by Seidel [Sei93]. Its similar to the
“leave one out” argument used in statistics for cross validation. The basic idea was probably known to the Greeks (or Russians or
French) at some point in time.
(Naturally, our summary of the development is cursory at best and not necessarily accurate, and all possible disclaimers apply.
A good summary is provided in the introduction of [Sei93].)
Sampling model. Our “almost” proof of Lemma 6.2.1 used a different sampling than the one used by the algorithm (i.e.,
sampling without replacement). Furthermore, Clarkson [Cla88] used random sampling with replacement. As a rule of thumb all
these sampling approaches are similar and yield similar results, and its a good idea to use which ever sampling scheme is the easiest
to analyse in figuring out whats going on. Of course, a formal proof requires analysing the algorithm in the sampling model its
uses.
Lazy randomized incremental construction. If one wants to compute a single face that contains a marking point in
an arrangement of curves, then the problem in using randomized incremental construction is that as you add curves, the region of
interest shrinks, and regions that were maintained should be ignored. One option is to perform flooding in the vertical decomposition
to figure out what trapezoids are still reachable from the marking point and maintaining only these trapezoids in the conflict graph.
Doing it in each iteration is way too expensive, but luckily one can use a lazy strategy that performs this clean up only logarithmic
number of times (i.e., you perform a clean up in an iteration if the iteration number is, say, a power of 2). This strategy complicates
the analysis a bit, see [dBDS95] for more details on this lazy randomize incremental construction technique. An alternative
technique was suggested by the author for the (more restricted) case of planar arrangements, see [Har00b]. The idea is to compute
only what the algorithm really need to compute the output, by computing the vertical decomposition in an exploratory online
fashion. The details are unfortunately overwhelming although the algorithm seems to perform quite well in practice.
Cuttings. The concept of cuttings was introduced by Clarkson. The first optimal size cutting were constructed by Chazelle and
Friedman [CF90], who proved the exponential decay lemma to this end. Our elegant proof follows the presentation by de Berg
and Schwarzkopf [dBS95]. The problem with this approach is that the constant involved in the cuttings size are awful° . Matoušek
[Mat98] showed that there (1/r)-cuttings with 8r2 + 6r + 4 trapezoids, by using level approximation. A different approach, was
taken by the author [Har00a], who showed how to get cuttings which seems to be quite small (i.e., constant-wise) in practice. The
basic idea is to do randomized incremental construction, but at each iteration greedily add all the trapezoids with conflict list small
enough to the output cutting. One can prove that this algorithm also generate O(r2 ) cuttings, but the details are not trivial as the
framework described in this chapter is not applicable for analyzing this algorithm.
Cuttings also can be computed in higher dimensions for hyperplanes, and in the place for well behaved curves, see [SA95].
Even more on randomized algorithms in geometry. We had only scratched the surface of this fascinating topic,
which is one of the corner stones of “modern” computational geometry. The interested reader should have a look in the books by
Mulmuley [Mul94a], Sharir and Agarwal [SA95], Matoušek [Mat02] and Boissonnat and Yvinec [BY98].
°
This is why all computations related to cuttings should be done on waiter’s bill pad. As Douglas Adams put it: “On a waiter’s
bill pad, reality and unreality collide on such a fundamental level that each becomes the other and anything is possible, within
certain parameters.”
83
6.5 Proof of Lemma 6.2.1
Proof of of Lemma 6.2.1: Let Eσ be the event that D(σ) ⊆ R and K(σ) ∩ R = ∅. Similarly, let E0σ be the event that D(σ) ⊆ R0 and
K(σ) ∩ R0 = ∅. By the axioms, we have that σ ∈ F (R) (resp., σ ∈ F (R0 )) if and only if Eσ (resp., E0σ ) happens.
The proof of this lemma is somewhat tedious and follows by careful calculations. Let δ = |D(σ)| ≤ d, ω = ω(σ), and for two
non-negative integers a ≤ x, let xa = x(x − 1) · · · (x − a + 1). Then
n−ω−δ n n n−ω−δ
Pr[Eσ ] r−δ r0 r0 r−δ
= n · n−ω−δ = n · n−ω−δ
Pr E0σ 0 0
r r −δ r r −δ
(n − r)! r! (n − ω − r0 )! (r0 − δ)!
= ·
(n − r0 )! r0 ! (n − ω − r)! (r − δ)!
r! (r − δ)! (n − ω − r0 )! (n − r)!
0
= · · ·
(r − δ)! r0 ! (n − ω − r)! (n − r0 )!
0 0
rδ (n − ω − r0 )r−r rd (n − ω − r0 )r−r
= · 0 ≤ · 0 .
r0 δ (n − r0 )r−r r0 d (n − r0 )r−r
By our assumption r0 = br/tc ≥ d, so we obtain
!δ !δ !d !d
rδ r−δ+1 r−d+1 r−d+1 r
≤ ≤ ≤ ≤ = O ((t + 1)d)d = O td ,
r 0 δ r −δ+1
0 r −d+1
0 r −d+1
0 r /d
0
since r0 − d + 1 ≥ r0 /d and d is a constant. To bound the second factor, we observe that, for i = r0 , r0 + 1, . . . , r − 1,
!r−r0 r−r0 !
ω r−r
0 0
(n − ω − r0 )r−r n−ω−r+1 ω ω(r − r0 )
≤ = 1− ≤ 1− ≤ exp .
(n − r0 )r−r0 n−r+1 n−r+1 n n
Since ω ≥ t(n/r), we have ω/n ≥ t/r, and therefore
!! !!
Pr[Eσ ] −ω(r − r0 ) −t(r − r0 )
= O td exp = O td exp
Pr E0σ n r
= O(td ) exp(−(t − 1)) = O(2−t ) ,
as desired.
84
Chapter 7
“Maybe the Nazis told the truth about us. Maybe the Nazis were the truth. We shouldn’t forget that: perhaps they
were the truth. The rest, just beautiful lies. We’ve sung many beautiful lies about ourselves. Perhaps that’s what I’m
trying to do - to sing another beautiful lie.”
– –The roots of heaven, Romain Gary
In this chapter, we introduce a “trivial” but yet powerful idea. Given a set S of objects, a point p that is contained in some
of the objects, and let its weight be the number of objects that contains it. We can estimate the depth/weight of p by counting the
number of objects that contains it in a random sample of the objects. In fact, by considering points induced by the sample, we can
bound the number of “light” vertices induced by S. This idea can be extended to bounding the number of “light” configurations
induced by a set of objects.
This approach leads to a sequence of short, beautiful, elegant and correct± proofs of several hallmark results in discrete
geometry.
While the results in this chapter are not directly related to approximation algorithms, the insights and general approach would
be useful for us later, or so one hopes.
The saying goes that “hard theorems have short, elegant and incorrect proofs”. This chapter can maybe serve as a counterex-
±
85
since j ≤ k and 1 − x ≥ e−2x , for 0 < x < 1/2.
On the other hand, the number of vertices on the 0-level of R is |R| − 1. As such,
X
Xp ≤ |R| − 1.
p∈V≤k
|V≤k | n
Putting these two inequalities together, we get that ≤ . Namely, |V≤k | ≤ e2 nk.
e2 k2 k
The connection to depth is simple. Every line defines a halfplane (i.e., the region above the line). A vertex of depth at most k,
is contained in at most k halfplanes. The above proof (intuitively) first observed that there are at most n/k vertices of the random
sample of zero depth (i.e., 0-level), and then showing that every vertex has probability (roughly) 1/k2 to have depth zero in the
random sample. It thus follows, that the if the number of vertices of level at most k is µ, then µ/k2 ≤ n/k; namely, µ = O(nk).
Theorem 7.2.1 (Euler’s formula.) For a connected planar graph G, we have f − e + v = 2, where f, e, v are the number of faces,
edges and vertices in a planar drawing of G.
Proof: We assume that the number of edges of G is maximal (i.e., no edges can be added without introducing a crossing). If it
is not maximal, then add edges till it becomes maximal. This implies that G is a triangulation (i.e., every face is a triangle). Then,
every face is adjacent to three edges, and as such 2e = 3 f . By Euler’s formula, we have f − e + v = (2/3)e − e + v = 2. Namely,
−e + 3v = 6. Alternatively, e = 3v − 6. However, if e is not maximal, this equality deteriorates to the required inequality.
For
example, the above inequality implies that the complete graph over 5 vertices (i.e., K5 ) is not planar. Indeed, it has
e = 52 = 10 edges, and v = 5 vertices, but if it was planar, the above inequality would imply that 10 = e ≤ 3v − 6 = 9, which is of
course false. (The reader can amuse herself by trying to prove that K3,3 , the bipartite complete graph with 3 vertices on each side,
is not planar.)
Kuratowski’s celebrated theorem states that a graph is planar if and only if it does not contain either K5 or K3,3 induced inside
it (formally, it does not have K5 or K3,3 as a minor).
For a graph G, we define the crossing number of G, denoted as c(G), as the minimal number of edge crossings in any drawing
of G in the plane. For a planar graph c(G) is zero, and it “larger” for “less planar” graphs.
Proof: If e − 3v + 6 ≤ 0 ≤ c(G) and the claim holds trivially. Otherwise, the graph G is not planar by Lemma 7.2.2. Draw G in
such a way that c(G) is realized and assume, for the sake of contradiction, that c(G) < e − 3v + 6. Let H be the graph resulting from
G, by removing from each pair of edges of G that intersects in the drawing one of the edges. We have e(H) ≥ e(G) − c(G). But H is
planar (since its drawing has no crossings), and by Lemma 7.2.2, we have e(H) ≤ 3v(H) − 6, or equivalently, e(G) − c(G) ≤ 3v − 6.
Namely, e − 3v + 6 ≤ c(G). Which contradicts our assumption.
Lemma 7.2.4 (The crossing lemma.) For a graph G, such that e ≥ v, we have c(G) = Ω(e3 /v2 ).
86
Proof: We consider a specific drawing D of G in the plane that has c(G) crossings. Next, let U be a random subset of V selected
by choosing each vertex of V to be in the sample with probability p > 0.
Let H = GU be inducted subgraph over U. Note, that only edges of G with both their endpoints in U “survive” in H.
Thus, the probability of a vertex v to survive in H is p. The probability of an edge of G to survive in H is p2 , and the probability
of a crossing (in this specific drawing D) to survive in the induced drawing DH (of H) is p4 . Let Xv and Xe denote the (random
variable which is the) number of vertices and edges surviving in H, respectively. Similarly, let Xc be the number of crossing
surviving in DH . By Claim 7.2.3, we have
Xc ≥ c(H) ≥ Xe − 3Xv + 6.
In particular, this holds in the expectation, and as such
Lemma 7.2.5 The maximum number of incidences between n points and m lines is I(n, m) = O(n2/3 m2/3 + n + m).
Proof: Let P and L be the set of n points and set of m lines, respectively, realizing I(m, n). Let G be a graph over the points of
P (we assume that P contains an additional point at infinity). We connect two points if they lie consecutively on a common line of
L. Clearly, e = e(G) = I + m and v = v(G) = n + 1, where I = I(m, n). Since we can interpret the arrangement of lines A(L) as a
drawing of G, where a crossing of two edges of G is just a vertex of A(L). As such, it follows that c(G) ≤ m2 , since m2 is a trivial
bound on the number of vertices of A(L). On the other hand, by Lemma 7.2.4, we have c(G) = Ω(e3 /v2 ). Thus,
(I + m)3 e3
= 2 = O(c(G)) = O(m2 ).
(n + 1)2 v
Assuming I ≥ m and I ≥ n, we have I = O(m2/3 n2/3 ). Or alternatively, I = O(n2/3 m2/3 + m + n).
Or invented – I have no dog in this argument.
87
that f− (y) = k − 1, which implies that the line passing through p with slope f (y) has a point s ∈ P, on it, and s is to the right of p.
Clearly, if we continue sweeping, the line would sweep over rp, which implies the claim.
Lemma 7.2.6 also holds by symmetry in the other direction: Between any two edges to the right of p, there is an antipodal
edge on the other side.
Lemma 7.2.7 Let p be a point of P, and let q be a point to its left, such that qp ∈ E(G) and it has the largest slope among all such
edges. Furthermore, assume that there are k − 1 points of P to the right of p. Then, there exists a point r ∈ P, such that pr ∈ E(G)
and pr has larger slope than qp.
Proof: Let α be the slope of qp, and observe that f (α) = k and f+ (α) = k − 1, and f (∞) ≥ k. Namely, there exists y > α such
that f (y) = k. We conclude that there is k-set adjacent to p on the right, with slope larger than α.
So, imagine that we are at an edge e = qp ∈ E(G), where q is to the left of p. We
rotate a line around p (counterclockwise) till we encounter an edge e0 = pr ∈ E(G), where
r is a point to the right of p. We can now walk from e to e0 , and continue walking in this
way, forming a chain of edges in G. Note, that by Lemma 7.2.6, no two such chains can
be “merged” into using the same edge. Furthermore, by Lemma 7.2.7, such a chain can
end only in the last k − 1 points of P (in their ordering along the x-axis). Namely, we
decomposed the edges of G into k − 1 edge disjoint convex chains (the chains are convex
since we rotate clockwise as we walk along a chain). The picture on the right shows the
5-sets and their decomposition into 4 convex chains.
Lemma 7.2.8 The edges of G can be decomposed into k − 1 convex chains C1 , . . . , Ck−1 .
Similarly, the edges of G can be decomposed into m = n − k + 1 concave chains
D1 , . . . , Dm .
Proof: The first part of the claim is proved above. As for the second claim, rotate the plane
by 180◦ . Every k-set is now (n − k + 2)-set, and by the above argumentation, the edges
of G can be decomposed into n − k + 1 convex chains, which are concave in the original
orientation.
Theorem 7.2.9 The number of k-sets defined by a set of n points in the plane is O nk1/3 .
Proof: The graph G has n = |P| vertices, and let m = |E(G)| be the number of k-sets. By Lemma 7.2.8, any crossing of two
edges of G, is an intersection point of one convex chain of C1 , . . . , Ck−1 with a concave chain of D1 , . . . , Dn−k+1 . Since a convex
chain and a concave chain can have at most two intersections, we conclude
that there are at most 2(k − 1)(n − k + 1) crossings in G.
By the Crossing Lemma (Lemma 7.2.4), there are at least Ω m3 /n2 crossings. Putting this two inequalities together, we conclude
m3 /n2 = O(nk), which implies m = O nk1/3 .
Theorem 7.3.1 Let S be a set of n objects as above, with combinatorial dimension d, and let k be a parameter. Let R be a random
sample created by picking each element of S with probability 1/k. Then, we have
h i
|T≤k (S)| ≤ c E kd f (|R|) ,
for a constant c.
88
d k −2 d
Proof: We reproduce the proof of Theorem
Every region σ ∈ T≤k appears in F (R) with probability ≥ 1/k (1−1/k) ≥ e /k .
7.1.1.
d 2
As such, E f (|R|) ≥ E[|F (R)|] ≥ |T≤k |/ k e .
Lemma 7.3.2 Let f (·) be a monotone increasing function which is well behaved; namely, that there exists a constant c, such that
f (xr) ≤ c f (r), for any r and 1 ≤ x ≤ 2. Let Y be the number of heads in n coin-flips where the probability for head is 1/k. Then
E f (Y) = O( f (n/k)).
X
k
E f (Y) ≤ f (10(n/k)) + f ((t + 1)k) Pr[Y ≥ t(n/k)]
t=10
X
k
≤ O( f (n/k)) + cdlg t+1e f (n/k)2−t(n/k) = O( f (n/k)) ,
t=10
Theorem 7.3.3 Let S be a set of n objects, with combinatorial dimension d, and let k be a parameter. Assume that the number
of regions formed by a set of m objects is bounded by a function f (m), and furthermore, f (m) is well behaved in the sense of
Lemma 7.3.2. Then, |T≤k (S)| = O kd f (n/k) .
Note, that if the function f (·) grows polynomially then Theorem 7.3.3 applies. It fails if f (·) grows exponentially.
We need the following fact, which we state without proof.
Theorem
7.3.4 (The Upper Bound Theorem.) The complexity of the convex-hull of n points in d dimensions is bounded by
O nbd/2c .
Example 7.3.5 (At most k-sets.) Let P be a set of n points in IRd . A region here is a halfspace with d points on its boundary. The
set of regions
defined
by P is just the faces of the convex hull of P. The complexity of the convex hull of n points in d dimensions is
f (n) = O nbd/2c , by Theorem 7.3.4. Two halfspaces h, h0 would be considered to be combinatorially different if P ∩ h , P ∩ h0 . As
such, the number of combinatorially different halfspaces containing at most k points of P is at most O kd f (n/k) = O kdd/2e nbd/2c .
At most k-level. The technique for bound the complexity of the at most k-level (or at most depth k) is generally attributed to
Clarkson-Shor [CS89] and more precisely it is from [Cla88]. Previous work on just the two dimensional variant include [GP84,
Wel86, AG86]. Our presentation in Section 7.1 and Section 7.3 follows (more or less) Sharir [Sha03]. The connection of this
technique to the crossing lemma is from there.
For a proof of th Upper Bound Theorem (Theorem 7.3.4), see Matoušek [Mat02].
The crossing lemma. The crossing lemma is originally by Ajtai et al. [ACNS82] and Leighton [Lei84]. The current greatly
simplified “proof from the book” is attributed to Sharir. The insight that this lemma has something to do with incidences and similar
problems is due to Székely [Szé97]. Elekes [Ele97] used the crossing lemma to prove surprising lower bounds on sum and products
problems.
The complexity of k-level and number of k-sets. This is considered to be one of the hardest problems in discrete
geometry, and there is still a big gap between the best lower bound [Tót01] and best upper bound currently known [Dey98]. Our
presentation in Section 7.2.2 follows suggestions by Micha Sharir, and is based on the result of Dey [Dey98] (which was in turn
inspired by the work of Agarwal et al. [AACS98]). This problem has long history, and the reader is referred to Dey [Dey98] for its
history.
89
Incidences. This problem again has long and painful history. The reader is referred to [Szé97] for details.
We only skimmed the surface of some problems in discrete geometry and results known in this field related to incidences and
k-sets. Good starting points for learning more are the books by Brass et al. [BMP05] and Matoušek [Mat02].
90
Chapter 8
As far as he personally was concerned there was nothing else for him to do except either shoot himself as soon
as he came home or send for his greatcoat and saber from the general’s apartment, take a bath in the town baths,
stop at Volgruber’s wine-cellar l afterwords, put his appetite in order again and book by telephone a ticket for the
performance in the town theater that evening.
– – The good soldier Svejk, Jaroslav Hasek
The data structure. Let R1 , . . . , RM be M independent random samples of S, formed by picking every element with proba-
bility 1/z, where l m
M = ν(ε) = c2 ε−2 log n ,
and c2 is a sufficiently large absolute constant. Build M separate emptiness-query data structures D1 , . . . , D M , for the sets R1 , . . . , R M ,
respectively, and put D = D(z, ε) = {D1 , . . . , D M }.
Answering a query. Consider a query range r, and let Xi = 1 if r intersects any of the objects of Ri and Xi = 0 otherwise, P
for
i = 1, . . . , N = ν(ε). The value of Xi can be determined using a single emptiness query in Di , for i = 1, . . . , M. Compute Yr = i Xi .
For a range σ of depth k, the probability that σ intersects one of the objects of Ri is
!k
1
ρ(k) = 1 − 1 − . (8.1)
z
If a range σ has depth z, then Λ = E[Yσ ] = ν(ε)ρ(z). Our data structure returns “depth(r, S) < z” if Yr < Λ, and “depth(r, S) ≥ z”
otherwise.
91
8.1.1.1 Correctness
In the following, we show that, with high probability, the data structure indeed returns the correct answer if the depth of the query
range is outside the “uncertainty” range [(1 − ε)z, (1 + ε)z]. For simplicity of exposition, we assume in the following that z ≥ 10 (the
case z < 10 follows by similar arguments). Consider a range r of depth at most (1 − ε)z. The data structure returns wrong answer
if Yr > Λ. We will show that the probability of this event is polynomially small. The other case, where r has depth at least (1 + ε)z
but Yr < Λ is handled in a similar fashion.
Intuition. Before jumping into the murky proof, let us consider the situation. Every sample Ri is an experiment. The experiment
succeeds if the sample contains an object that intersects the query range r. The probability of success is ρ(z), see Eq. (8.1), where
z is the weight of r. Now, if there is a big enough gap between ρ((1 − ε)z) and ρ(k), then we could decide if the range is “heavy”
(i.e., weight exceeding z) or “light” (i.e., weight smaller than (1 − ε)z) by estimating the probability γ that r intersects an object in
the random sample.
Indeed, if r is “light” then γ ≤ ρ((1 − ε)z), and if it is “heavy” then γ ≥ ρ(z). We estimate γ by the quantity Yr /M; namely,
repeating the experiment M times, and dividing the number of successes by the number of experiments (i.e., M). Now, we need
to determine how many experiments we need to perform till we get a good estimate, which is reliable enough to carry out our
nefarious task of distinguishing the light case from the heavy case. Clearly, the bigger the gap is between ρ((1 − ε)z) and ρ(z), the
fewer experiments required. Our proof would first establish that the gap between these two probabilities is Ω(ε), and next we will
plug this into Chernoff inequality to figure out how large M has to be for this estimate to be reliable.
Observation 8.1.2 For x ∈ [0, 1/2], it holds exp(−2x) ≤ 1 − x. Similarly, for x ≥ 0, we have 1 − x ≤ exp(−x) and 1 + x ≤ exp(x).
!(1−ε)z
1 M
E[Yr ] = M ρ((1 − ε)z) = M · 1 − 1 − ≥ M · 1 − e−(1−ε) ≥ ,
z 3
since 1 − 1/z ≤ exp(−1/z) and ε ≤ 1/2. By definition, Λ = M ρ(z), therefore, by Observation 8.1.1, we have
z !(1−ε)z !z !(1−ε)z !εz !
Λ 1 − 1 − 1z 1 1 1 1
ξ= = ≥1+ 1− − 1− =1+ 1− 1− 1− .
E[Yr ] 1 − 1 − 1 (1−ε)z z z z z
z
92
Now, by applying Observation 8.1.2 repeatedly, we have
! !!
2 1 1
ξ ≥ 1 + exp − (1 − ε)z · 1 − exp − εz = 1 + 2 1 − exp(−ε)
z z e
1 ε ε
≥1+ 2 1− 1− ≥1+ .
e 2 15
Deploying the Chernoff inequality (Theorem 25.2.6), we have that if µr = depth(r, S) = (1 − ε)z then
α = Pr[Yr > Λ] ≤ Pr Yr > ξ E[Yr ] ≤ Pr Yr > (1 + ε/15) E[Yr ]
! ! 2 l −2 m
1 ε 2 Mε2 ε c2 ε log n
≤ exp − E[Yr ] ≤ exp − ≤ exp− ≤ n−c4 ,
4 15 c3 c3
Lemma 8.1.4 Given a set S of n objects, a parameter 0 < ε < 1/2, and z ∈ [0, n], one can construct a data structure D which,
given a range r, returns either or . If it returns , then µr ≤ (1 + ε)z, and if it returns then µr ≥ (1 − ε)z. The data structure
might return either answer if µr ∈ [(1 − ε)z, (1 + ε)z].
The data structure D consists of M = O(ε−2 log n) emptiness data structures. The space and preprocessing time needed to
build them are O(S (2n/z)ε−2 log n) where S (m) is the space (and preprocessing time) needed for a single emptiness data structure
storing m objects.
The query time is O(Q(2n/z)ε−2 log n), where Q(m) is the time needed for a single query in such a structure, respectively. All
bounds hold with high probability.
Proof: The lemma follows immediately from the above discussion. The only missing part is observing that by the Chernoff
inequality we have that |Ri | ≤ 2n/z, and this holds with high probability.
Answering a query. Given a range query r, each data structure in our list returns or . Moreover, with high probability, if
we were to query all the data structures, we would get a sequence of s, followed by a sequence of s. It is easy to verify that the
value associated with the last data structure returning (rounded to the nearest integer) yields the required approximation. We can
−1
use binary search on D(v1 ), . . . , D(vW ) to locate this changeover
value using a total of O(log W) = O(log(ε log n)) queries in the
structures of D! , . . . , DW . Namely, the overall query time is O Q(n)ε−2 (log n) log(ε−1 log n) .
Theorem 8.1.5 Given a set of S of n objects, and assume that one can construct, using S (n) space, in T (n) time, a data structure
that answers emptiness queries in Q(n) time.
Then, one can construct, using O S (n)ε−3 log2 n space, in O(T (n)ε−3 log2 n) time, a data structure that, given a range r,
outputs a number αr , with (1 − ε)µr ≤ αr ≤ µr . The query time is O ε−2 Q(n)(log n) log(ε−1 log n) . The result returned is correct
with high probability for all queries and the running time bounds hold with high probability.
The bounds of Theorem 8.1.5 can be improved, see Section 8.4 for details.
93
8.2 Application: halfplane and halfspace range counting
Using the data structure of Dobkin and Kirkpatrick [DK85], one can answer emptiness halfspace range searching queries in loga-
rithmic time. In this case, we have S (n) = O(n), T (n) = O(n log n), and Q(n) = O(log n).
Corollary 8.2.1 Given a set P of n points in two (resp., three) dimensions, and a parameter ε > 0, one can construct in
O(npoly(1/ε, log n)) time a data structure, of size O(npoly(1/ε, log n)), such that given a halfplane (resp. halfspace) r, it out-
puts a number α, such that (1 − ε) |r ∩ P| ≤ α ≤ |r ∩ P|, and the query time is O poly(1/ε, log n) . The result returned is correct
with high probability for all queries.
Using the standard lifting of points in IR2 to the paraboloid in IR3 implies a similar result for approximate range counting for
disks, as a disk range query in the plane reduces to a halfspace range query in three dimensions.
Corollary 8.2.2 Given a set of P of n points in two dimensions, and a parameter ε, one can construct a data structure in
O npoly(1/ε, log n) time, using O npoly(1/ε, log n) space, such that given a disk r, it outputs a number α, such that (1−ε) |r ∩ P| ≤
α ≤ |r ∩ P|, and the query time is O poly(1/ε, log n) . The result returned is correct with high probability for all possible queries.
Depth queries. By computing the union of a set of n pseudodisks in the plane, and preprocessing the union for point-location
queries, one can perform “emptiness” queries in this case in logarithmic time. (Again, we are assuming here that we can perform
the geometric primitives on the pseudodisks in constant time.) The space needed is O(n) and it takes O(n log n) time to construct it.
Thus, we get the following result.
Corollary 8.2.3 Given a set of S of n pseudodisks in the plane, one can preprocess them in O(nε−2 log2 n) time, using O(nε−2 log n)
space, such
that given
a query point q, one can output a number α, such that (1 − ε)depth(p, S) ≤ α ≤ depth(p, S), and the query
time is O ε−2 log2 n . The result returned is correct with high probability for all possible queries.
Lemma 8.3.1 ((Reliable sampling.)) Let S be a set of n objects, 0 < ε < 1/2, and let r be a point of depth u ≥ k in S. Let R be a
random sample of S, such that every element is picked into the sample with probability
8 1
p= ln .
kε2 δ
Let X be the depth of r in R. Then, with probability ≥ 1 − δ, we have that estimated depth of r, that is X/p, lies in the interval
[(1 − ε)u, (1 + ε)u].
In fact, this estimates succeeds with probability ≥ 1 − δu/k .
Proof: We have that µ = E[X] = pu. As such, by Chernoff inequality (Theorem 25.2.6 and Theorem 25.2.9), we have
Pr X < (1 − ε)µ, (1 + ε)µ = Pr X < (1 − ε)µ + Pr X > (1 + ε)µ
≤ exp −puε2 /2 + exp −puε2 /4
! !
1 1
≤ exp −4 ln + exp −2 ln ≤ δ,
δ δ
since u ≥ k.
Note, that if r depth in S is (say) u ≤ 10k then the depth of r in sample is (with the stated probabilities)
!
1 1
depth(r, R) ≤ (1 + ε)pu = O 2 ln .
ε δ
Which is (relatively) a small number. Namely, via sampling, we turned the task of estimating the depth of heavy ranges, into the
task of estimating the depth of a shallow range. To see why this is true, observe that we can perform a binary (exponential) search
for the depth of r by a sequence of coarser to finer samples.
94
8.4 Bibliographical notes
The presentation here follows the work by Aronov and Har-Peled [AH05]. The basic idea is folklore and predates this paper, but the
formal connection between approximate counting to emptiness is from this paper. One can improve the efficiency of this reduction
by being more careful, see the full version of [AH05] for details. Followups to this work include [KS06, AC07, AHS07].
95
96
Chapter 9
At the sight of the still intact city, he remembered his great international precursors and set the whole place on fire
with his artillery in order that those who came after him might work off their excess energies in rebuilding.
– – The tin drum, Gunter Grass
In this chapter, we shortly describe (and analyze) a simple randomized algorithm for linear programming in low dimensions.
Next, we show how to extend this algorithm to solve linear programming with violations. Finally, we would show how one can
efficiently approximate the number constraints one need to violate to make a linear program feasible. This serves as a fruitful
ground to demonstrate some the techniques we visited already.
Our discussion is going to be somewhat intuitive. We will fill in the details, and prove correctness formally of our algorithms
in the next chapter.
97
9.1.1 A solution, and how to verify it
Observe that an optimal solution of a LP is either a vertex or unbounded. Indeed, if the optimal solution p lies in the middle of
a segment s, such that s is feasible, then either one of its endpoints provide a better solution (i.e., one of them is lower in the xd
direction than p), or both endpoints of s have the same target value. But then, we can move the solution to one of the endpoints of
s. In particular, if the solution lies on a k-dimensional facet F of the boundary of the feasible polyhedron (i.e., formally F is a set
with affine dimension k formed by intersection the boundary of the polyhedron by a hyperplane), we can move it so that it lies on a
(k − 1)-dimensional facet F 0 of the feasible polyhedron, using the proceedings argumentation. Using it repeatedly, one ends up in a
vertex of the polyhedron, or in an unbounded solution.
Thus, given an instance of LP, the LP solver should output one of the following answers.
(A) Finite. The optimal solution is finite, and the solver would provides a vertex which realizes the optimal solution.
(B) Unbounded. The given LP has an unbounded solution. In this case, the LP would output a ray ζ, such that the ζ lies inside
the feasible region, and it points downward the negative xd -axis direction.
(C) Infeasible. The given LP does not have any point which comply with all the given inequalities. In this case the solver would
output d + 1 constraints which are infeasible on their own.
Lemma 9.1.1 Given a set of d linear inequalities in IRd , one can compute the vertex formed by the intersection of their boundaries
in O(d3 ) time.
Proof: Write down the system of equalities that the vertex must fulfil. Its a system of d equalities in d variables and it can be solved
in O(d3 ) time using Gaussian elimination.
A cone is the intersection of d constraints, where its apex is the vertex associated with this set of constraints. A set of such d
constraints is a basis. An intersection of d − 1 of the hyperplanes of a basis form a line and clipping this line to the cone of the basis
form a ray. Clipping the same line to the feasible region would yield either a segment, referred to as an edge of the polytope, or a
ray. An edge of the polyhedron connects two vertices of the polyhedron As such, one can think about the boundary of the feasible
region as inducing a graph – its vertices are the vertices of the polyhedron, and the edges of the polyhedron. Since every vertex has
d hyperplanes
defining it (its basis), and an adjacent edge is defined by d − 1 of these hyperplanes, it follows that each vertex has
d
d−1
= d edges adjacent to it.
The following lemma tells us when we have an optimal vertex. While it is intuitively clear, its proof requires a systematic
understanding of how the feasible region of a linear program looks like, and we delegate it to the next chapter.
Lemma 9.1.2 Let L be a given linear program, and let P denote its feasible region. Let v be a vertex P , such that all the d rays
emanating from v are in the upward xd -axis direction, then v is the lowest (in the xd -axis direction) point in P and it is thus the
optimal solution to L.
Interestingly, when we are at vertex of v of the feasible region, it is easy to find the adjacent vertices. Indeed, compute the d
rays emanating from v. For such a ray, intersect it with all the constraints of the LP. The closest intersection point along this ray is
the vertex u of the feasible region adjacent to v. Doing this naively takes O dn + dconst time.
Lemma 9.1.2 offers a simple algorithm for computing the optimal solution for an LP. Start from a feasible vertex of the LP.
As long as this vertex has at least one ray that points downward, follow this ray to adjacent vertex on the feasible polytope that is
lower than the current vertex (i.e., compute the d rays emanating from the current vertex, and follow one of the rays that points
downward, till you hit a new vertex). Repeat this till the current vertex has all rays pointing upward, by Lemma 9.1.2 this is the
optimal solution. Up to tedious (and non-trivial) details this is the simplex algorithm.
We need also the following lemma, which its proof is delegated to the next chapter.
Lemma 9.1.3 If L is a LP in d dimensions which is not feasible, then there exists d + 1 inequalities in L which are infeasible on
their own.
Note, that given a set of d + 1 inequalities, its easy to verify if it feasible or not. Indeed, compute the d+1
d
vertices formed by
this set of constraints, and check whether any of this vertices are feasible. If all of them are infeasible, then this set of constraints is
infeasible.
98
We remind the reader that the input to the algorithm is the LP L which is defined by a set of n linear inequalities in IRd . We
are looking for the lowest point in IRd which is feasible for L.
A vertex v is acceptable if all the d rays associated with it points upward (note, that the vertex itself might not be feasible).
The optimal solution (if it is finite) must be located at an acceptable vertex. Assume that we are given the basis B = {h1 , . . . , hd } of
such an acceptable vertex. Let hd+1 , . . . , hm be a random permutation of the remaining constraints of the LP L.
Our algorithm is randomized incremental. At the ith step, for i ≥ d, it would maintain he optimal solution for the first i
constraints. As such, in the ith step, the algorithm checks whether the optimal solution vi−1 of the previous iteration is still feasible
with the new constraint hi (namely, the algorithm checks if vi is inside the halfspace defined by hi ). If vi−1 is still feasible, then it is
still the optimal solution, and we set vi ← vi−1 .
The more interesting case, is when vi−1 < hi . First, we check if the basis of vi−1 together with hi form a set of constraints which
is infeasible. If so, the given LP is infeasible, and we output B(vi−1 ) ∪ {hi } as our proof of infeasibility.
Otherwise, the new optimal solution must lie on the hyperplane associated with hi . As such, we recursively compute the lowest
T
vertex in the (d − 1)-dimensional polyhedron (∂hi ) ∩ i−1 j=1 h j . This is a linear program involving i − 1 constraints, and it involves
d − 1 variables since it lies on the (d − 1)-dimensional hyperplane ∂hi . The solution found vi is defined by a basis of d − 1 constraints,
and adding hi to it, results in an acceptable vertex that is feasible, and we continue to the next iteration.
Clearly, the vertex vn is the required optimal solution.
(d)+d d d
®
Indeed, (i−d)+d
lies between i−d
and d
= 1.
99
Proof: The expected running time is
S (m, d) = O(md) + S (m, d − 1) + T (m, d),
where T (m, d) is the time to solve a LP in the restricted case of Section 9.2.1. The solution to this recurrence is O (3d)d m , see
Lemma 9.2.1.
since 1 − x ≥ e−2x , for 0 < x < 1/2. If this happens then the optimal solution for L b is v. This can be verified by computing how
many constraints of L the optimal solution of L b violates. If it violates more than k constraints we ignore it. Otherwise, we return
this as our candidate solution.
Next, we amplify the probability of success by repeating this process M = 8kd ln(1/δ) times, returning the best solution found.
The probability that in all these (independent) iterations we had failed to generate the optimal (violated) solution is at most
!M M !!
M 1 1
(1 − α) ≤ 1 − d ≤ exp − d = exp − ln = δ.
8k 8k δ
Theorem 9.3.1 Let L be a linear program with n constraints over d variables, let k > 0 be a parameter, and δ > 0 a confidence
parameter. Then one can compute the optimal solution to L violating at most k constraints of L, in O(mkd log(1/δ)) time. The
solution returned is correct with probability ≥ 1 − δ.
Lemma 9.4.1 Let L be a linear program with n constraints over d variables, let k > 0 and ε > 0 be parameters. Then one can
compute a solution to L violating at most (1 + ε)k constraints of L such that its value is better than the optimal solution violating
k constraints of L. The expected running time of the algorithm is
!!
logd+1 n logd+2 n
O n + n min , .
ε2d k ε2d+2
The algorithm succeeds with high probability.
b with probability ρ. Next, the algorithm computes optimal
Proof: Let ρ = O kεd2 ln n and pick each constraints of L into L
b violating
solution u in L
k0 = (1 + ε/3)ρk
‘constraints, and return this as the required solution.
I am sure the reader guessed correctly the consequences of such a despicable scenario: The universe collapses and is replaced
by a cucumber.
100
We need to prove the correctness of this algorithm. To this end, the reliable sampling lemma (Lemma 8.3.1) states that for any
vertex v of depth u in L, has depth in the range
(1 − ε/3)uρ, (1 + ε/3)uρ
b and this holds with high probability, where u ≥ k (here we are using the fact that there are at most nd vertices defined by L).
in L,
b
In particular, let vopt be the optimal solution for L of depth k. With high probability, vopt has depth ≤ (1 + ε/3)pk = k0 in L,
0 b
which implies that the returned solution v is better than vopt , since v has depth k in L.
Next, we need to prove that v is not too deep. So, assume that v is of depth β in L. By the reliable sampling lemma, we have
b
that the depth of v in L is in the range (1 − ε/3)βρ, (1 + ε/3)βρ . In particular, we know that (1 − ε/3)βρ ≤ k0 = (1 + ε/3)ρk. That
is
1 + ε/3
β≤ k ≤ (1 + ε/3)(1 + ε/2)k ≤ (1 + ε)k,
1 − ε/3
since 1/(1 − ε/3) ≤ 1 + ε/2 for ε ≤ 1® .
As for the running time, we are using the algorithm of Theorem 9.3.1. The input size is O(nρ) and the depth threshold is k0 .
(The bound on the input size holds with high probability. We omit the easy but the tedious proof of that using Chernoff inequality.)
As such, the running time is
!!
logd+1 n logd+2 n
O n + nρ(k0 )d log n = O n + n min n, nρ (ρk)d log n = O n + n min , .
ε2d k ε2d+2
Note, that the running time of Lemma 9.4.1 is linear if k is sufficiently large and ε is fixed.
If these two axioms holds, we refer to (H, w) as a LP-type problem. It is easy to verify that linear programming is a LP-type
problem.
Definition 9.5.1 A basis is a subset B ⊆ H such that w(B) > −∞, and w(B0 ) < w(B), for any proper subset B0 of B.
As in linear programming, we have to assume that certain basic operations can be performed quickly. These operations are:
(A) (Violation test.) For a constraint h and a basis B, test whether h is violated by B or not. Namely, test if w(B ∪ {h}) > w(B).
(B) (Basis computation.) For a constraint h, and a basis B, computes the basis of B ∪ {h}.
We also need to assume that we are given an initial basis B0 from which to start our computation. The combinatorial dimension
of (H, w) is the maximum size of s basis of H. Its easy to verify that the algorithm we presented for linear programming (the special
case of Section 9.2.1) works verbatim in this settings. Indeed, start with B0 , and randomly permute the remaining constraints. Now,
add the constraints in a random order, and each step check if the new constraints violates the current solution, and if so, update the
basis of the new solution. The recursive call here, corresponds to solving a subproblem where some members of the basis are fixed.
We conclude:
Theorem 9.5.2 Let (H, w) be a LP-type problem with n constraints with combinatorial dimension d. Assume that the basic
operations takes constant time, we have that (H, w) can be solved using dO(d) m basic operations.
®
Indeed (1 − ε/3)(1 + ε/2) ≤ 1 − ε/3 + ε/2 − ε2 /6 ≥ 1.
101
9.5.1 Examples for LP-type problems
Smallest enclosing ball. Given a set P of n points in IRd , and let r(P) denote the radius of the smallest enclosing ball in IRd .
Under general position assumptions, there are at most d + 1 points on the boundary of this smallest enclosing ball. We claim that
the problem is an LP-type problem. Indeed, the basis in this case is the set of points determining the smallest enclosing ball. The
combinatorial dimension is thus d + 1. The monotonicity property holds trivially. As for the locality property, assume that we have
a set Q ⊆ P such that r(Q) = r(P). As such, P and Q have the same enclosing ball. Now, if we add a point p to Q and the radius of
its minimum enclosing ball increases, then the ball enclosing P must also change (and get bigger) when we insert p to P. Thus, this
is a LP-type problem, and it can be solved in linear time.
Theorem 9.5.3 Given a set P of n points in IRd , one can compute its smallest enclosing ball in (expected) linear time.
Finding time of first intersection. Let C(t) be a parameterized convex shape in IRd , such that C(0) is empty, and C(t) (
C(t0 ) if t < t0 . We are given n such shapes C1 , . . . , Cn , and we would like to decide the minimal t for which they all have a common
intersection. Assume, that given a point p and such a shape C, we can decide (in constant time) the minimum t for which p ∈ C(t).
Similarly, given (say) d +1 of these shapes, we can decide in constant time the minimum t for which they intersect, and this common
point of intersection. We would like to find the minimum t for which they all intersect. Let also assume that these shapes are well
behaved in the sense that, for any t, we have lim∆→0 Vol(C(t + ∆) \ C(t)) = 0 (namely, such a shape can not “jump” – it grows
continuously). It is easy to verify that this is a LP-type problem, and as such it can be solved in linear time.
Note, that this problem is an extension of the previous problem. Indeed, if we group a ball of radius t around each point of P,
then the problem of deciding the minimal t when all these growing balls have a non-empty intersection, is equivalent to finding the
minimum radius ball enclosing all points.
Linear programming in low dimensions. The first to realize that linear programming can be solved in linear time
in low dimensions was Megiddo [Meg83, Meg84]. His algorithm was deterministic but considerably more complicated than
the randomized algorithm we present. Clarkson [Cla95] showed how to use randomization to get a simple algorithm for linear
programming with running time O(d2 n + noise), where the noise is a constant exponential in d. Our presentation follows the paper
by Seidel [Sei91]. Surprisingly, one can achieve running time with the noise being subexponential in d. This follows by plugging
in the subexponential algorithms of Kalai
[Kal92]orpMatoušek et al. [MSW96] into Clarkson algorithm [Cla95]. The resulting
algorithm has expected running time O d2 n + exp c d log d , for some constant c. See the survey by Goldwasser [Gol95] for
more details.
More information on Clarkson’s algorithm. Clarkson’s algorithm contains some interesting new ideas that are worth mentioning
shortly. (Matoušek et al. [MSW96] algorithm is somewhat similar to the algorithm we presented.)
Observe that if the solution for a random sample R is being violated by a set X of constraints, then X must
√ contains (at least)
one constraint
√ which is in the basis of the
√ optimal solution. Thus, by picking R to be of size (roughly) n, we know that it is
a 1/ n-net, and there would be at most n constraints violating the solution of R. Thus, repeating this d times, at each stage
solving√the problem on the collected constraints from previous iteration, together with the current random sample, results in a set
of O(d n) constraints that contains the optimal basis. Now solve recursively the linear program on this (greatly reduced) set of
constraints. Namely, we spent O(d2 n) time (d times √ checking if the n constraints violates a given solution), called recursively d
times on “small” subproblems of size (roughly) O( n), resulting in a fast algorithm.
102
An alternative algorithm, uses the same observation, by using the reweighting technique. Here each constraint is sampled
according to its weight (which is initially 1. By doubling the weight of the violated constraints, one can argue that after a small
number of iterations, the sample would contain the required basis, while being small. See Chapter 17 for more details.
Clarkson algorithm works by combining these two algorithms together.
Linear programming with violations. The algorithm of Section 9.3 seems to be new, although it is implicit in the work
of Matoušek [Mat95b], which present a slightly faster deterministic algorithm. The first paper on this problem (in two dimensions),
is due to Everett et al. [ERvK96]. This was extended by Matoušek to higher dimensions [Mat95b]. His algorithm relies on the
idea of computing all O(kd ) local maximas in the “k-level” explicitly, by traveling between them. This is done by solving linear
programming instances which are “similar”. As such, these results can be further improved using techniques for dynamic linear
programming that allows insertion and deletions of constraints, see the work by Chan [Cha96]. Chan [Cha05] showed how to
further improve these algorithms for dimensions 2, 3 and 4, although these improvements disappear if k is close to linear.
The idea of approximate linear programming with violations is due to Aronov and Har-Peled [AH05], and our presentation
follows their results. Using more advanced data-structures these results can be further improved (as far as the polylog noise is
concerned), see the work by Afshani and Chan [AC07].
LP-type problems. The notion of LP-type algorithm is mentioned in the work of Sharir and Welzl [SW92]. They also showed
up that deciding if a set of (axis parallel) rectangles can be pierced by 3-points is a LP-type problem (quite surprising as the problem
has no convex programming flavor). Our example of computing first intersection of growing convex sets, is motivated by the work
of Amenta [Ame94] on the connection between LP-type problems and Helly-type theorems.
Intuitively, any lower dimensional convex programming problem is a natura‘l candidate to be solved using LP-type techniques.
9.7 Exercises
103
104
Chapter 10
I don’t know why it should be, I am sure; but the sight of another man asleep in bed when I am up, maddens me. It
seems to me so shocking to see the precious hours of a man’s life - the priceless moments that will never come back
to him again - being wasted in mere brutish sleep.
– – Three men in a boat, Jerome K. Jerome
In this chapter, we formally investigate how the feasible region of a linear program looks like, and establish the correctness
of the algorithm we presented for linear programming. Linear programming seems to be one case where the geometric intuition is
quite clear, but crystallizing it into a formal proof requires quite a bit of work. In particular, we prove in this chapter more than we
strictly need, since it support (and may we dare suggesting, with due humble and respect to the most esteemed reader, that it even
expands¯ ) our natural intuition.
Underlining our discussion is the dichotomy between the input to LP, which is a set of halfspaces, and the entities LP works
with, which are vertices. In particular, we need to establish that speaking about the feasible region of a LP in terms of (convex hull
of) vertices, or alternatively, as the intersection of halfspaces is equivalent.
10.1 Preliminaries
We had already encountered Radon’s theorem, which we restate.
Theorem 10.1.1 (Radon’s Theorem.) Let P = {p1 , . . . , pd+2 } be a set of d + 2 points in IRd . Then, there exists two disjoint subsets
Q and R of P, such that CH(Q) ∩ CH(R) , ∅.
Theorem 10.1.2 (Helly’s theorem.) Let F be a set of n convex sets in IRd . The intersection of all the sets of F is non-empty if and
only if any d + 1 of them has non-empty intersection.
¯
Hopefully the reader is happy we are less polite to him/her in the rest of the book, since otherwise the text would be insufferably
tedious.
105
T T
since U(Q) ∪ U(R) = F. As such, R ⊆ Y∈U(Q) Y and Q ⊆ Y∈U(R) Y. Now, by the convexity of the sets of F, we have
T T
CH(R) ⊆ Y∈U(Q) Y and CH(Q) ⊆ Y∈U(R) Y. Namely, we have
\ \ \
r ∈ CH(R) ∩ CH(Q) ⊆
Y ∩ Y = Y.
Y∈U(Q) Y∈U(R) Y∈F
Theorem 10.1.3 (Carathéodory theorem.) Let X be a convex set in IRd , and let p be some point in the interior of X. Then p is a
convex combination of d + 1 points of X.
P
Proof: Suppose p = mk=1 λi xi is a convex combination of m > d + 1 points, where {x1 , . . . , xm } ⊆ X, λ1 , . . . , λm > 0 and
P
i λi = 1. We will show that p can be rewritten as a convex combination of m − 1 of these points, as long as m > d + 1.
So, consider the following system of equations
X
m X
m
γi xi = 0 and γi = 0. (10.1)
i=1 i=1
It has m > d + 1 variables (i.e., γ1 , . . . , γm ) but only d + 1 equations. As such, there is a non-trivial solution to this system of
equations, and denote it by b γm . Since b
γ1 , . . . ,b γ1 + · · · + b
γm = 0, some of b γi s are strictly positive, and some of them are strictly
negative. Let
λj
τ = min > 0.
γ j >0 b
j, b γj
And assume without loss of generality, that τ = λ1 /b
γ1 . Let
e
λi = λi − τb
γi ,
for i = 1, . . . , m. Then e γ1 b
λ1 = λi − λ1 /b γi = 0. Furthermore, if b γi < 0, then e γi ≥ λi > 0. Otherwise, if b
λi = λi − τb γi > 0, then
!
λj λi
eλi = λi − min b
γi ≥ λi − b γi ≥ 0.
γ j >0 b
j, b γj b
γi
So, e
λ1 = 0 and e
λ2 , . . . , e
λm ≥ 0. Furthermore,
X
m X
m
X X m X m m
e
λi = λi − τb
γi = λi − τ b
γi = λi = 1,
i=1 i=1 i=1 i=1 i=1
P
since b γm is a solution to Eq. (10.1). As such, q = mi=2 e
γ1 , . . . ,b λi xi is a convex combination of m − 1 points of X. Furthermore, as
e
λ1 = 0, we have
X
m X
m X
m
X m X X m X m m
q= e
λi xi = e
λi xi = λi − τb
γi xi = λi xi − τ b
γi xi = λi xi − τ 0 = λi xi = p,
i=2 i=1 i=1 i=1 i=1 i=1 i=1
since (again) b
γ1 , . . . ,b
γm is a solution to Eq. (10.1). As such, we found a representation of p as a convex combination of m − 1 points,
and we can continue in this process till m = d + 1, establishing the theorem.
106
Fourier-Motzkin elimination. Let L = (M, b) be an instance of LP (we care only about feasibility here, so we ignore the
target function). Consider the ith variable xi in the LP L. If it appears only with positive coefficient in all the inequalities (i.e., the
ith column of M has only positive numbers) then we can set the ith variable to be sufficiently large negative number, and all the
inequalities where it appear would be immediately feasible. Same holds if all such coefficients are negative. Thus, consider the case
where the variable xi appears with both negative coefficients and positive coefficients in the LP. Let us inspect two such inequalities,
say the kth and lth inequality, and assume, for the sake of concreteness, that Mki > 0 and Mli < 0. Clearly, we can multiply the lth
inequality by a positive number and add it to the kth inequality, so that in the resulting inequality the coefficient of xi is zero.
Let L0 = elim(L, i) denote the resulting linear program, where we copy all inequalities of the original LP where xi has zero as
coefficient. In addition, we add all the inequalities that can formed by taking “positive” and “negative” appearances in the original
LP of xi and canceling them out as described above. Note, that L0 might have m2 /4 inequalities, but since xi is now eliminated (i.e.,
all of its appearances are with coefficient zero), the LP L0 is defined over d − 1 variables.
Lemma 10.1.4 Let L be an instance of LP with d variables and m inequalities, the linear program L0 = elim(L, i) is feasible if and
only if L is feasible, for any i ∈ {1, . . . , d}.
Proof: One direction is easy, if L is feasible then its solution (omitting the ith variable) is feasible for L0 .
The other direction, becomes clear once we understand what the elimination really do. So, consider two inequalities in L, such
that Mki < 0 and M ji > 0. We can rewrite these inequalities such that they become
X
a0 + aτ xτ ≤ xi (A)
τ,i
X
and xi ≤ b0 + bτ xτ , (B)
τ,i
respectively. The eliminated inequality described above, is no more than the inequality we get by chaining these inequalities
together; that is X X X X
a0 + aτ xτ ≤ xi ≤ b0 + bτ xτ ⇒ a0 + aτ xτ ≤ b0 + bτ xτ .
τ,i τ,i τ,i τ,i
In particular, for a feasible solution to L0 , all the left sides if inequalities of type (A) must be smaller (equal) than the right side of
all the inequalities of type (B), since we combined all such pairs of inequalities into an inequality in L0 .
In particular, given a feasible solution sol to L0 , one can extend it into a solution of L by computing a value of xi , such that all the
original inequalities hold. Indeed, every pair of inequalities as above of L, when we substitute the values of x1 , . . . , xi−1 , xi+1 , . . . , xd
in sol into L, results in an interval I, such that xi must lie inside I. Each inequality of type (A) induces a left endpoint of such
an interval, and each inequality of type (B) induced a right endpoint of such an interval. We create all possible intervals of this
type (using these endpoints) when creating L0 , and as such for sol, all these intervals must be non-empty. We conclude that the
intersection of all these intervals is non-empty, implying that one can pick a value to xi such that L is feasible.
Given a H-polyhedron P , the elimination of the ith variable elim(P , i) can be interpreted as projecting the polyhedron P into
the hyperplane xi = 0. Furthermore, the projected polyhedron is still a H-polyhedron. By change of variables, this implies that any
projection a H-polyhedron into a hyperplane, is a H-polyhedron. We can repeat this process of projecting down the polyhedron
several times, which implies the following lemma.
Lemma 10.1.5 The projection of a H-polyhedron into any affine subspace is a H-polyhedron.
Lemma 10.1.6 (Farakas Lemma) Let M ∈ IRm×d and b ∈ IRm specify a LP. Then either:
(i) There exists a feasible solution x ∈ IRd to the LP. That is M x ≤ b,
(ii) Or, there is no feasible solution, and we can prove it by combining the inequalities of the LP into an infeasible inequality.
Formally, there exists a vector w ∈ IRm , such that w ≥ 0, wM = 0 and w · b < 0. Namely, the inequality (that must hold if the
LP is feasible)
(wM) x ≤ w · b
is infeasible.
Proof: Clearly, the two options can not hold together, so all we need to show is that if an LP is infeasible than the second
option holds.
Observe, that if we apply a sequence of eliminations to an LP, all the resulting inequalities in the new LP are positive combi-
nation of the original inequalities. So, let us apply this process of elimination to each one of the variables in turn. If the original LP
is not feasible, then sooner or later, this elimination process would get stuck, as it will generate an infeasible inequality. Such an
inequality would have all the variables with zero coefficients, and the constant would be negative. Thus establishing the claim.
107
Note, that the above elimination process provides us with a naive algorithm for solving LP. This algorithm is extremely
d
inefficient, as in the last elimination stage, we might end up with m2 inequalities. This would lead to an algorithm which is
unacceptably slow for solving LP.
P P
We remind the reader that a linear equality of the form i ai xi = c can be rewritten as the two inequalities i ai xi ≤ c and
P
i ai xi ≥ c. The great success of the Farakas Lemma in the market place had lead to several sequels to it, and here is one of them.
Lemma 10.1.7 (Farakas Lemma II) Let M ∈ IRm×d and b ∈ IRm , and consider the following linear program M x = b, x ≥ 0. Then,
either (i) this linear program is feasible, or (ii) there is a vector w ∈ IRm such that wM ≥ 0 and w · b < 0.
Proof: If (ii) holds, then consider a feasible solution x to the LP. We have that M x = b, multiplying this equality from the left
by w, we have that
(wM)x = w · b.
But (ii) claims that the quantity on the left is non-negative (since x ≥ 0), and the quantity on the right is negative. A contradiction.
As such these two options are mutually exclusive.
The linear program M x = b, x ≥ 0 can be rewritten as M x ≤ b, M x ≥ b, x ≥ 0, which in turn is equivalent to the LP:
Mx ≤ b M b
−M x ≤ −b ⇐⇒ −M x ≤ −b ,
−x ≤ 0 −Id 0
where Id is the d × d identity matrix. Now, if the original LP does not have a feasible solution, then by the original Farakas lemma
(Lemma 10.1.6), there must be a vector (w1 , w2 , w3 ) ≥ 0 such that
M b
(w1 , w2 , w3 ) −M = 0 and (w1 , w2 , w3 ) · −b < 0.
−Id 0
Namely, (w1 − w2 ) M − w3 = 0 and (w1 − w2 ) · b < 0. But w3 ≥ 0, which implies that
(w1 − w2 ) M = w3 ≥ 0.
Namely, the claim holds for w = w1 − w2 .
Cones and vertices. For a set of vectors V ⊆ IRd , let cone(V) denote the cone they generate. Formally,
X →
− →
−
cone(V) =
ti v i v i ∈ V, ti ≥ 0
.
i
A halfspace passes through a point p, if p is contained in the hyperplane bounding this halfspace. Since 0 is the apex of cone(V),
it is natural to presume that cone(V) can be generated by a finite intersection of halfspaces, all of them passing through the origin
(which is indeed true).
In the following, let e(i, d) denote the ith orthonormal vector in IRd ; namely,
i−1 coords d−i coords
z }| { z }| {
e(i, d) = ( 0, . . . , 0, 1, 0, . . . , 0 ).
Lemma 10.1.8 Let M ∈ IRm×d be a given matrix, and consider the H-polyhedron P formed by all points (x, w) ∈ IRd+m , such that
M x ≤ w. Then P = cone(V), where V is a finite set of vectors in IRd+m .
Proof: Let Ei = e(i, d + m), for i = d + 1, . . . , d + m. Clearly, the inequality M x ≤ w trivially holds if (x, w) = Ei , for
i = d + 1, . . . , d + m (since x = 0 in such a case). Also, let
→−
v i = e(i, d) , M e(i, d) ,
→− n→
− →
− o
for i = 1, . . . , d. Clearly, the inequality holds for v i , since trivially, M e(i, d) ≤ M e(i, d), for i = 1, . . . , d. Let V = v 1 , . . . , v d , Ed+1 , . . . , Ed+m .
We claim that cone(V) = P . One direction is easy, as the inequality holds for all the vectors in V, it also holds for any
positive combination of these vectors, implying that cone(V) ⊆ P . The other direction is slightly more challenging. n→ Consider an
− − o
→
(x, w) ∈ IRd+m such that M x ≤ w. Clearly, w − M x ≥ 0. As such, (x, w) = (x, M x) + (0, w − M x). Now, (x, M x) ∈ cone v 1 , . . . , v d
n→
− →
− o
and since w−M x ≥ 0, we have (0, w−M x) ∈ cone(Ed+1 , . . . , Ed+m ). Thus, P ⊆ cone v 1 , . . . , v d +cone(Ed+1 , . . . , Ed+m ) = cone(V).
We need the following simple pairing lemma, which we leave as an exercise for the reader (see Exercise 10.5.1).
As in most sequels, Farakas Lemma II is equivalent to the original Farakas Lemma. I only hope the reader does not feel
cheated.
108
P P
Lemma 10.1.9 Let α1 , . . . , αn , β1 , . . . , βm be positive numbers, such that ni=1 αi = mj=1 β j , and consider the positive combination
→
− Pn →− Pm → − →− →− → − →−
w = i=1 αi v i + i= j β j u j , where v 1 , . . . , v n , u 1 , . . . , u m are vectors (say, in IRd ). Then, there are non-negative δi, j s, such that
− P
→ →
− → −
w = i, j δi, j v i + u j .
Lemma 10.1.10 Let C = cone(V) be a cone generate by a set of vectors V in IRd . Consider the region P = C ∩ h, where h is a
hyperplane that passes through the origin. Then, P is a cone; namely, there exists a set of vectors V0 such that P = cone(V0 ).
Proof: By rigid rotation of the axis system, we can assume that h ≡ x1 = 0. Furthermore, by scaling, we can assume that the
first coordinate of all points in V is either −1, 0 or 1 (clearly, scaling of vectors generating the cone does not effect the cone itself).
Let V = V−1 ∪ V0 ∪ V1 , where V0 is the set of vectors in V with the first coordinate being zero, and V−1 (resp. V1 ) are the vectors
in V with the first
n→ coordinateo being n−1 (resp. 1).o
− →
− →
− →−
Let V−1 = v 1 , . . . , v n , V1 = u 1 , . . . , u m , and let
− →
→ −
V0 = V0 ∪ v i + u j i = 1, . . . , n, j = 1, . . . , m .
Clearly, all the vectors of V0 have zero in their first coordinate, and as such V0 ⊆ h, implying that cone(V0 ) ⊆ C ∩ h.
→− →− → − P → − P → − →−
As for the other direction, consider a vector w ∈ C∩h. It can be rewritten as w = w 0 + i αi v i + j β j u j , where w 0 ∈ cone(V0 ).
→
− P P
Since the first coordinate of w is zero, we must have that i αi = j β j . Now, by the above pairing lemma (Lemma 10.1.9), we
have that there are (non-negative) δi, j s, such that
X
→
− → − →− → −
w = w0 + δi, j v i + u j ∈ cone(V0 ) ,
i, j
0
implying that C ∩ h ⊆ cone(V ).
We next handle the general question of how the intersection of a general cone with a hyperplane looks like.
Lemma 10.1.11 Let C = cone(V) be a cone generate by a set of vectors V in IRd . Consider the region P = C ∩ h, where h is any
hyperplane in IRd . Then, there exists sets of vectors U and W such that P ∩ h = CH(U) + cone(W).
Proof: As before, by change of variables, we can assume that h ≡ x1 = 1. As before, we can normalize the vectors of V so that
the first coordinate is either 0, −1 or 1. Next, as before, break V into three disjoint sets, such that V = V−1 ∪ V0 ∪ V1 , where Vi are
the vectors in V with the first coordinate being i, for i ∈ {−1, 0, 1}.
By Lemma 10.1.10, there exists a set of vectors W0 that spans cone(W0 ) = C ∩ (x1 = 0). Also, let X = cone(V1 ) ∩ (x1 = 1). A
P → − →−
point p ∈ X is a positive combination of p = i ti v i , where v i ∈ V1 , where ti ≥ 0 for all i. But the first coordinate of all the points
P
of V1 is 1, and so is the first coordinate of p. Namely, it must be that i ti = 1. Implying that X = CH(V1 ).
We claim that P = cone(V) ∩(x1 = 1) is equal to Y = CH(V1 ) + cone(W0 ). One direction is easy, as V1 , W0 ⊆ cone(V), it
follows that Y ⊆ cone(V). Now, a point of Y is the sum of two vectors, one of them has 1 in the first coordinate, and the other has
zero there. As such, a point of Y lies on the hyperplane x1 = 1, implying Y ⊆ P .
As for the other direction, consider a point p ∈ P . It can be written as
X X X
→
− →
− →
−
p= αi v i + βj u j + γk w k ,
→− →− →−
i, v i ∈V1 j, u j ∈V0 k, w k ∈V−1
where αi , β j , γk ≥ 0, for all i, j, k. Now, by considering only the first coordinate of this sum of vectors, we have
X X
αi − γk = 1.
i k
P P P
In particular, αi can be rewritten as αi = ai + bi , where ai , bi ≥ 0, i bi = k γk and i ai = 1. As such,
X X X X
→− →
− →
− −
→
p= ai v i + βj u j + bi v i + γk w k ,
→− →− →− →−
i, v i ∈V1 j, u j ∈V0 i, v i ∈V1 k, w k ∈V−1
X X
P P →
− −
→
Now, since i bi = k γk , we have bi v i + γk w k ∈ (x1 = 0). As such, we have
→− →
−
i, v i ∈V1 k, w k ∈V−1
X X X
→
− →
− →
−
βj u j + bi v i + γk w k ∈ cone(W0 ) .
→− →− →−
j, u j ∈V0 i, v i ∈V1 k, w k ∈V−1
P P →
−
v i ∈V1 ai v i ∈ CH(V1 ). Thus p is a sum of two points, one of them in cone(W0 )
Also, i ai = 1 and, for all i, ai ≥ 0, implying that i,→
−
).
and the other in CH(V1 Implying that p ∈ Y and thus P ⊆ Y. We conclude that P = Y.
109
Theorem 10.1.12 A cone C is generated by a finite set V ⊆ IRd (that is C = cone(V)), if and only if, there exists a finite set of
halfspaces, all passing through the origin, such that their intersection is C.
In linear programming lingo, a cone C is finitely generated by V, if and only if there exists a matrix M ∈ IRm×d , such that x ∈ C
if and only if M x ≤ 0.
Proof:
n→ Let C = ocone(V), and observe that a point p ∈ cone(V) can be written as (part of) a solution to a linear program, indeed
− →
−
let V = v 1 , . . . , v m , and consider the linear program:
X
m
→
−
x ∈ IRd ti v i = x
i=1
∀i ti ≥ 0.
P → −
Clearly, any x ∈ C, there are t1 , . . . , tm , such that i ti v i = x. Thus, the projection of this LP with m + d variables into the subspace
formed by the coordinates of the d variables x = (x1 , . . . , xd ) is the set C. Now, by Lemma 10.1.5, the projection of this LP is also a
H-polyhedron, implying that C is a H-polyhedron.
As for the other direction, consider the H-polyhedron P formed by the set of points (x, w) ∈ IRd+m , such that M x ≤ w. By
Lemma 10.1.8, there exists V ⊆ IRd+m such that cone(V) = P . Now,
A repeated application of Lemma 10.1.10 implies that the above set is a cone generated by a (finite) set of vectors, since wi = 0 is
a hyperplane, for i = 1, . . . , m.
Theorem 10.1.13 A region P is a H-polyhedron in IRd , if and only if, there exist finite sets P, V ⊆ IRd such that P = CH(P) +
cone(V).
P
Proof: Consider the linear inequality di=1 ai xi ≤ b, which is one the constraints defining the polyhedron P . We can lift
this inequality into an inequality passing through the origin, by introducing an extra variable. The resulting inequality in IRd+1 is
Pd
i=1 ai xi − bxd+1 ≤ 0. Clearly, this inequality defines a halfspace that goes through the origin, and furthermore, its intersection with
xd+1 = 1 is exactly (P , 1) (i.e., the set of points in the polyhedron P concatenated with 1 as the last coordinate). Thus, by doing this
“lifting” for all the linear constraints defining P , we get a cone C with an apex in the origin, defined by the intersection of halfspaces
passing through the origin. As such, by Theorem 10.1.12, there exists a set of vectors V ⊆ IRd+1 , such that C = cone(V) and
C ∩(xd+1 = 1) = (P , 1). Now, Lemma 10.1.11 implies that there exists P, W ⊆ IRd+1 , such that C ∩(xd+1 = 1) = CH(P) + cone(W).
Dropping the d + 1 coordinate from this points, imply that P = CH(P0 ) + cone(W0 ), where P0 , and W0 are P and W projected on
the first d coordinates, respectively.
As for the other direction, assume that P = CH(P) + cone(V). Let P0 = (P, 1) and V0 = (V, 1) be the lifting of P and V to d + 1
dimensions by padding it with an extra coordinate. Now, clearly,
In fact, the containment holds also in the other direction, since a point pßncone(P0 ∪ V0 ) ∩ (xd+1 = 1), is made out of a convex
combination of the points of P0 (since they have 1 in the (d + 1)th coordinate) and a positive combination of the points of V0 (that
have 0 in the (d + 1)th coordinate). Thus, we have that (P , 1) = cone(P0 ∪ V0 ) ∩(xd+1 = 1). The cone C 0 = cone(P0 ∪ V0 ) can be
described as a finite intersection of halfspaces (all passing through the origin), by Theorem 10.1.12. Let L be the equivalent linear
program. Now, replace xd = 1 into this linear program. This results a linear program over IRd , such that its feasible region is P .
Namely, P is a H-polyhedron.
A polytope is the convex hull of a finite point set. Theorem 10.1.13 implies that a polytope is also formed by the intersection
of a finite set of halfspaces. Namely, a polytope is a bounded polyhedron.
A linear inequality a · x ≤ b is valid for a polytope P if it holds for all x ∈ P . A face of P is a set
F = P ∩(a · x = b) ,
where a · x ≤ b is a valid inequality for P . The dimension of F is the affine dimension of the affine space it spans. As such, a vertex
is 0 dimensional. Intuitively, vertices are the “corners” of the polytopes.
Lemma 10.1.14 A vertex p of a polyhedron P can not be written as a convex combination of a set X of points, such that X ⊆ P
and p < X.
Proof: Let h be the hyperplane that its intersection with P is (only) p. Now, all the points of X must lie on one side of h, and
there is no point of X on h itself. As such, any convex combination of the points of X would lie strictly on one side of h.
110
Claim 10.1.15 Let P be a polytope in IRd . Then, every polytope is the convex hull of its vertices; namely, P = CH(vert(P )).
Furthermore, if for a set V ⊆ IRd , we have P = CH(V), then vert(P ) ⊆ V.
Proof: The polytope P is a bounded intersection of halfspaces. By Theorem 10.1.13, it can represented as the sum of a convex
hull of a finite point set, with a cone. But the cone here is empty, since P is bounded. Thus, there exists a fine set V, such that
P = CH(V). Let p be a point of V. If p can be expressed as convex combination of the points of V \{p}, then CH(V \ {p}) = CH(V),
since any point expressed as a convex combination involving p can be rewritten to exclude p. So, let X be the resulting set after we
remove all such superfluous vertices. We claim that X = vert(P ).
P P
Indeed, if p ∈ X, then the following LP does not have a solution: i ti = 1 and mi=1 ti pi = p, where X \ p = {p1 , . . . , pm }. In
matrix form, this LP is
1 1 ... 1 1
t1
t2
.. = and (t1 , . . . , tm ) ≥ 0.
p1 p2 ··· pm . p
tm
| {z }
M
By Farakas Lemma II (Lemma 10.1.7), since this LP is not feasible, there exists a vector w ∈ IRd+1 , such that wM ≥ 0 and
w · (1, p) < 0. Writing w = (α, s), we can restate these inequalities as
Proof: Consider a point p ∈ P . We can assume that for the hyperplane h ≡ (w · x = c1 ), defining P /v, it holds w · p < c1
and (otherwise, we can can translate h so this holds). In particular,
consider
the point q = vp ∩(P /v). By the convexity of P ,
we have q ∈ P , and as such q ∈ P /v. Thus, q − v ∈ cone P /v − v and thus p − v ∈ cone P /v − v , which implies that
p ∈ v + cone P /v − v = conev,P .
Lemma 10.1.17 Let h be a halfspace defined by w · x ≤ c3 , such that for a vertex v of P , we have w · v = c3 and h is valid for P /v
(i.e., for all x ∈ P /v we have w · x ≤ c3 ), then h is valid for P .
Proof: Consider the linear function f (x) = c3 − w · x. Its zero for v and non-negative on P /v. As such, its non-negative for
any ray starting at v and passing through a point of P /v. As such f (·) is non-negative for any point of conev,P , which implies by
Lemma ?? that f (·) is non-negative for P .
Lemma 10.1.18 There is a bijection between the k-dimensional faces of P that contain v and the (k − 1)-dimensional faces of P /v.
Specifically, for a face f of P , the corresponding face is
π(f) = f ∩ h,
111
where h ≡ w · x = c1 is the hyperplane of P /v. Similarly, for a (k − 1)-dimensional face g of P /v, the corresponding face of P is
σ(g) = affine(v, g) ∩ P ,
where affine(v, g) is the affine subspace spanned by v and the points of g.
Proof: For the sake of simplicity of exposition, assume that h ≡ xd = 0 and furthermore v[d ] > 0, where v[d ] denotes the dth
coordinate of v. This can always be realized bya rigid rotation
and translation of space.
A face f of P where v ∈ f is defined as P ∩ w2 · x = c2 , where w2 · x ≤ c2 is a valid inequality for P . Now,
π(f) = f ∩ h = P ∩ w2 · x = c2 ∩ h = (P ∩ h) ∩ w2 · x = c2 = (P /v) ∩ w2 · x = c2 ,
where w2 · x ≤ c2 is a valid inequality for the polytope P /v ⊆ P . As such, π(f) is a face of P /v. Note, that if f is k-dimensional and
k > 1, then it contains two vertices of P , which are on different sides of h, and as such π(f) is not empty.
For g be a face of P /v, defined as g = P /v ∩(w3 · x = c3 ), where w3 · x ≤ c3 is an inequality valid for P /v. Note, that by setting
w3 [d ] to be a sufficiently large negative number, we can guarantee that w3 · v < c3 .
For λ > 0, consider the convex combination of the two inequalities w3 · x ≤ c3 and xd ≤ 0; that is
h(λ) ≡ (w3 ) · x + λxd ≤ c3 .
Geometrically, as we increase λ ≥ 0, the halfspace h(λ) is formed by a hyperplane rotating around the affine subspace s formed by
the intersection of the hyperplanes w3 · x = c3 (for λ = 0) and xd = 0 (for λ = ∞).® Since the two original inequalities are valid for
P /v, it follows that h(λ) is valid for P /v, for any λ ≥ 0. On the other hand, v ∈ h(0) and v < h(∞). It follows, that there is a value
λ0 of λ, such that v lies on the hyperplane bounding h(λ0 ). Since h(λ0 ) is valid for P /v, it follows, by Lemma 10.1.17, that h(λ0 ) is
valid for P . As such, f = h(λ0 ) ∩ P is a face of P that contains both v and g. As such, affine(v, g) ∩ P ⊆ f. It remain to show equality.
So consider a point p ∈ f, and as before we can assume that p[d ] < 0. As such, r = pv ∩(xd = 0) is a point that is on the boundary
of P /v, and furthermore r ∈ g since h(λ0 ) and w3 · x ≤ c3 have the same intersection with the hyperplane xd = 0 (i.e., the boundary
of this intersection is s), which implies that r ∈ affine(v, g), and as such f = affine(v, g) ∩ P .
As such, the maps π and σ are well defined. We remain with the task of verifying that they are the inverse of each other.
Indeed, for a face g of P /v we have
π σ(g) = π(affine(v, g) ∩ P ) = (affine(v, g) ∩ P ) ∩ h = affine(g) ∩(P ∩ h)
= affine(g) ∩ P /v = g,
since v < h and g ⊆ h. Similarly, for a face f of P that contains v, we have
σ π(f) = σ f ∩ h = affine(v, f ∩ h) ∩ P = affine(f) ∩ P = f,
since affine(f) can be written as the affine subspace of v together with a set of points in f ∩ h.
Lemma 10.2.1 Let v be a vertex of a H-polyhedron P in IRd , and let f (·) be a linear function, such that f (·) is non-decreasing
along all the edges leaving v (i.e., v is a “local” minimum of f (·) along the edges adjacent to v), then v realizes the global minimum
of f (·) on P .
Proof: Assume for the sake of contradiction that this is false, and let x be a point in P , such that f (x) < f (v). Let P /v = P ∩ h,
where h is a hyperplane.
By convexity, the segment xv must intersect P /v, and let y be this intersection point. Since P /v is a convex polytope in d − 1
P
dimensions, y can be written as a convex combination of d of its vertices u1 , . . . , ud ; namely, y = i αi ui , where αi ≥ 0 and
P
i αi = 1. By Lemma 10.1.18, each one of these vertices lie on edge of P adjacent to v, and by the local optimality of v, we have
f (ui ) ≥ f (v). Now, by the linearity of f (·), we have
X X
f (y) = f αi ui =
αi f (ui ) ≥ f (v).
i i
But y is a convex combination of x and v and f (x) < f (v). As such, f (y) < f (v). A contradiction.
®
As such, g ⊆ s. Not that we need this fact anywhere.
112
10.3 Garbage
In the following, assume that the polytope (i.e., feasible region) is bounded.
Lemma 10.3.1 Let P be a bounded polytope, and let V be the set of vertices of P . Then P = CH(V).
Consider the intersection of d − 1 hyperplanes of the LP with P . Clearly, this is either empty or a segment connecting two
vertices of P . If the intersection is not empty, we will refer to it as edge connecting the two vertices forming its endpoints. Consider
the graph G = G(V, E) formed by this set of edges. The target function assign each vertex a value. By general position assumption,
we can assume that no pair of vertices has the same target value assigned to it. A vertex is a sink if its target value is lower than all
its neighbors. Assume that G contains a single sink and its connected.
Start a traversal of G in an arbitrary vertex v, and repeatedly move to any of its neighbors that has lower target value than itself.
This walk would stop once we arrive to the sink, which is the required optimal solution to the LP. This is essentially the simplex
algorithm for Linear Programming.
To this end, we need to prove that G is connected and has a single sink.
113
114
Chapter 11
“Napoleon has not been conquered by man. He was greater than all of us. But god punished him because he relied
on his own intelligence alone, until that prodigious instrument was strained to breaking point. Everything breaks in
the end.”
– Carl XIV Johan, King of Sweden
11.1 Introduction
Let P be a set of n points in IRd . We would like to preprocess it, such that given a query point q, one can determine the closest
point in P to q quickly. Unfortunately, the exact problem seems to require prohibitive preprocessing time. (Namely, computing the
Voronoi diagram of P, and preprocessing it for point-location queries. This requires (roughly) O(ndd/2e ) time.)
Instead, we will specify a parameter ε > 0, and build a data-structure that answers (1 + ε)-approximate nearest neighbor
queries.
This is yet another instance where solving the bounded spread case is relatively easy.
115
The set Ai is a set of nodes of depth i in the quadtree that the algorithm visits. Note, all these nodes belong to the canonical
grid G2−i of level −i, where every canonical square has sidelength 2−i . (Thus, nodes of depth i in the quadtree are of level −i. This
is somewhat confusing but it in fact makes the presentation simpler.)
Correctness. Note that the algorithm adds a node w to Ai only if the set Pw might contain points which are closer to q than the
(best) current nearest neighbor the algorithm found, where Pw is the set of points stored in the subtree of w. (More precisely, Pw
might contain a point which is (1 − ε/2) closer to q than any point encountered so far.)
Consider the last node w inspected by the algorithm such that b q ∈ Pw . Since the algorithm decided to throw this node away,
we have, by the triangle inequality, that
q
≥
q − rep
− diam(2 ) ≥ (1 − ε/2)r .
q − b w w curr
Thus,
q − b
q
/(1 − ε/2) ≥ rcurr . However, 1/(1 − ε/2) ≤ 1 + ε, for 1 ≥ ε > 0, as can be easily verified. Thus, rcurr ≤ (1 + ε)dP (q),
and the algorithm returns an (1 + ε)-ANN to q.
Running time analysis. Before barging into a formal proof of the running
time of the above search procedure, it is useful to visualize the execution of
the algorithm. It visits the quadtree level by level. As long as the level grid
cells are bigger than the ANN distance r = dP (q), the number of nodes visited
is a constant (i.e., |Ai | = O(1)). This number “explodes” only when the cell Cell size
size become smaller than r, but then the search stops when we reach grid size ≈ dP (q)
O(εr). In particular, since the number grid cells visited (in the second stage)
grows exponentially with the level, we can use the number of nodes visited in Cell size
| {z } ≈ εdP (q)
the bottom level (i.e., O(1/εd )) to bound the query running time for this part of
O(1/εd )
the query.
Lemma 11.2.1 Let P be a set of n points contained inside the unit hypercube
in IRd , and let T be a quadtree
of P, where
diam(P) = Ω(1).
Let q
be a query point, and let ε > 0 be a parameter. An (1 + ε)-ANN
to q can be computed O ε−d + log(1/$) time, where $ =
q − b q
.
Proof: The algorithm is described above. We only left with the task of bounding the query time. Observe that if a node w ∈ T
is considered by the algorithm, and diam(2w ) < (ε/4)$ then
q − rep
− diam(2 ) ≥
q − rep
− (ε/4)$ ≥ r − (ε/4)r ≥ (1 − ε/4)r ,
w w w curr curr curr
which implies that neither w nor any of its children would be inserted into the sets A1 , . . ., Am , where m is the depth T, by Eq. (11.1).
Thus, no nodes of depth ≥ h = − lg($ε/4) are being considered by the algorithm. √
Consider the node u of T of depth i containing b q. Clearly, the distance between q and repu is at most `i = $+diamu = $+ d2−i .
As such, in the end of the ith iteration, we have rcurr ≤ `i , since the algorithm had inspected u. Thus, the only cells of G2−i−1 that
might be considered by the algorithm are the ones in distance ≤ `i from q. The number of such cells is
& '!d √ d !
`i $ + d2−i $ d d
ni = 2 −i−1 = O1 +
= O 1 + = O 1 + 2i $ ,
2 2 −i−1 2 −i−1
since for any a, b ≥ 0 we have (a + b)d ≤ (2 max(a, b))d ≤ 2d ad + bd . Thus, the total number of nodes visited is
X d− lg($ε/4)e d !d !
X
h
1 $ 1 1
ni = O 1 + 2 $ , = Olg
i
+ = O log + d ,
$ε $ε/4 $ ε
i=0 i=0
A less trivial task, is to adapt the algorithm, so that it uses compressed quadtrees. To this end, the algorithm would still handle
the nodes by levels. This requires us to keep a heap of integers in the range 0, −1, . . . , − lg Φ(P) . This can be easily done by
maintaining an array of size O(log Φ(P)), where each array cell, maintains a linked list of all nodes with this level. Clearly, an
insertion/deletion into this heap data-structure can be handled in constant time by augmenting it with a hash table. Thus, the above
algorithm would work for this case after modifying it to use this “level” heap instead of just the sets Ai .
116
Theorem 11.2.3 Let P be a set of n points in IRd . One can preprocess P in O(n log n) time, and using
linear space, such that given
a query point q and parameter 1 ≥ ε > 0, one can return an (1 + ε)-ANN to q in O 1/εd + log Φ(P) time. In fact, the query time is
O(1/εd + log(diam(P)/$)), where $ = dP (q).
Plan of attack. To answer ANN query in the general case, we will first get a fast rough approximation. Next, using a
compressed quadtree, we would find a constant number of relevant nodes, and apply Theorem 11.2.3 to those nodes. This would
yields the required approximation. Before solving this problem, we need a minor extension of the compressed quadtree data-
structure.
Lemma 11.3.1 One can perform a call query in a compressed quadtree T b, in O(log n) time, where n is the size of T
b. Namely, given
a query canonical cell 2 b such that 2w ⊆ 2
b, one can find, in O(log n) time, the node w ∈ T b and P ∩ 2b = Pw .
Lemma 11.3.2 Let P be a set of n points in IRd . One can build a data structure TR , in O(n log n) time, such that given a query point
q ∈ IRd , one can return a (1 + 4n)-ANN of q in P in O(log n) time.
Given a query point q, using TR , we compute a point u ∈ P, such that $ ≤ ku − qk ≤ (1 + 4n)$, where $ = dP (q). Let
R = ku − qk and r = ku − qk /(4n + 1). Clearly, r ≤ $ ≤ R. Next, compute ` = lg R , and let C be the set of cells of G2` that are
` S
in distance ≤ R from q. Clearly, since R ≤ 2 , it follows that b q ∈ 2∈C 2, where b q is the nearest neighbor to q in P. For each cell
2 ∈ C, we compute the node v ∈ T b such that P ∩ 2 = Pv , using a cell query (i.e., Lemma 11.3.1). Let V be the resulting set of
nodes of Tb.
For each node of v ∈ V, we now apply the algorithm of Theorem 11.2.3 to the compressed quadtree rooted at v. Since
|V| = O(1), and diam(Pv ) = O(R), for all v ∈ V, the query time is
!
X 1 diam(Pv ) 1 X diam(Pv ) 1 X R
O d + log
= O d + log
= O d + log
v∈V
ε r ε v∈V
r ε v∈V
r
!
1
= O d + log n .
ε
117
As for the correctness of the algorithm, notice that there is a node w ∈ V, such that b
q ∈ Pw . As such, when we apply the
algorithm of Theorem 11.2.3 to w, it would return us a (1 + ε)-ANN to q.
Theorem 11.3.3 Let P be a set of n points in IRd . One can construct a data-structure of linear size, in O(n log n) time, such that
given a query point q ∈ IRd , and a parameter 1 ≥ ε > 0, one can compute a (1 + ε)-ANN to q in O(1/εd + log n) time.
11.4.1 Low Quality ANN Search - Point Location with Random Shifting
11.4.1.1 The data-structure and search procedure
→−
Let P be a set of n points in IRd , contained
inside
the square [0, 1/2]d . Let v be a random vector in the square [0, 1/2]d , and consider
→− →−
the point set Q = P + v = p + v p ∈ P . Clearly, given a query point q, we can answer the ANN on P, by answering the ANN
→− b build for Q.
query q + v on Q. Note, that Q is contained inside the unit cube, and consider the compressed quadtree T
Given a query point q, let v be the node of T b that its region rgv contains q0 = q + →
−
v . If rgv is a cube (i.e., v is a leaf) and the v
stores a point p ∈ Q inside it, then we return kq0 − pk as the distance to the ANN. If v does not store a point of Q, then we return
2diam(rgv ) as the distance to the ANN.
Things get more exciting if v is a compressed node. In this case, there is no point of Q associated with v (by construction). Let
w be its only child, and return d(rgw , q0 ) + diam(rgv ) as the distance to the ANN.
Proof: If kIk ≥ r, then any translation of I contains a point of U, and the claim trivially holds.
Otherwise, let X be the set of real numbers such that if x ∈ X, then I + x contains a point of U. The set X is a repetitive set of
S
intervals. Formally, X = k ([−β, −α] + kr). As such, the total length of X inside the interval [0, 1/2] is
ρ=
kIk
[0, 1/2]
,
r
since r is a power of 2. As such, the probability that a random point inside [0, 1/2] would fall inside X is ρ/
[0, 1/2]
= kIk /r.
→−
Lemma 11.4.2 Let s be a segment of length ∆. Consider the randomly shifted segment s + v . The probability that s + v intersect
−`
the boundary of the canonical grid Gr is at most d∆/r, where r = 2 and ` ≥ 1.
Proof: Let Ii = {αi , βi } be the projection of s into the ith axis, for i = 1, . . . , d. Clearly, s+v intersects
the
separating hyperplanes
orthogonal to the ith dimension, if (the randomly shifted) interval Ii contains a point of U = ir i integer . By Lemma 11.4.1, the
→−
probability for that is at most δi /r, where δi = kIi k. Thus, the probability that s + v intersects Gr is bounded by
X
d
δi d∆
≤ ,
i=1
r r
as claimed.
Lemma 11.4.3 For any integer i, and a query point q, the above data-structure returns a 4ni approximation to the distance to the
nearest neighbor of q in P, with probability ≥ 1 − 2n−i+1 .
118
Proof: For the time being, it would be easier to consider T b to be a regular (i.e., not compressed) quadtree of the point set Q.
And let v be the leaf of T b that contains the query point q0 = q + → −
v . Let p = qb0 be the nearest neighbor to q0 in Q, and consider the
0 →− 0
segment s = qb q. Clearly, s = s + v = q p.
Now, if s0 is completely contained in the leaf v of T b that contains q0 , then p is stored in v, and we return ks0 k as the distance to
the ANN. √ √
If s0 intersects the boundary of the leaf v, let r be the side length of rgv . Observe, that r ≥ ksk / d. Indeed, if r ≤ ksk / d then
rgv ⊆ b(q0 , ksk), but the interior of b(q0 , ksk) is empty of any points of Q. Namely, the region of v will not be further refined by the
construction algorithm. As such, the result returned by the algorithm is never too small, it can only be too large.
In particular, if the side length of the leaf that contains q0 is r, then s0 intersects the boundary of the grid Gr , and the probability
for that to happened is at most d$/r, by Lemma 11.4.2. Thus, let x be the smallest power of two which is larger than ni $. We have
that
h √ i h i X ∞
d$ X d$
∞
2
Pr dist. returned ≥ 2 dni $ ≤ Pr r ≥ ni $ ≤ ≤ ≤ i−1 .
i=0
2i x i=0
2i ni $ n
Namely, if T is a t-ring tree, then for any node v ∈ T, the interior of the ring b(pv , (1 + t)rv ) \ b(pv , rv ) is empty of any point of
P. Intuitively, the bigger t is, the better T clusters P.
The ANN search procedure. Let q denote the query point. Initially, set v to be the root of T, and rcurr ← ∞. The algorithm
answer the ANN query by traversing down T.
During the traversal, we first compute the distance l =
q − repv
. If this is smaller than rcurr (the distance to the current nearest
neighbor found) then we update rcurr (and store the point realizing the new value of rcurr ).
r, we continue the search recursively in the child containing Pvin , where b
If kq − pv k ≤ b r = (1 + t/2)rv is the “middle” radius
of the ring. Otherwise, we continue the search in the subtree containing Pvout . The algorithm stops when reaching a leaf of T, and
returns the point realizing rcurr is the approximate nearest neighbor.
119
Intuition. If the query point q is outside the outer ball of a node v, it is so far from the points inside
inner ball (i.e., Pvin ), and we can treat all of them as a single point (i.e., repv ). On the other hand, q
if the query point q0 is inside the inner ball, then it must have a neighbor nearby (i.e., a point of
Pvin ), and all the points of Pvout are far enough away that they can be ignored. Naturally, if the query
point falls inside the ring, the same argumentation works (with slightly worst constants), using the pv
b
r q0
middle radius as the splitting boundary in the search. See figure on the right.
Lemma 11.4.5 Given a t-ring tree T, one can answer (1 + 4/t)-approximate nearest neighbor
queries, in O(depth(T)) time.
Proof: Clearly, the query time is O(depth(T)). As for the quality of approximation, let π denote the generated search path in T
and b
q denote the nearest neighbor to q in P. Furthermore, let w denote the last node in the
search
path π, such that bq ∈ Pw . Clearly,
q ∈ Pwin , but we continued the search in Pwout , then q is outside the middle sphere, and
q − b
if b q
≥ (t/2)rw (since this is the distance
between the middle sphere and the inner sphere). Thus,
q − rep
≤
q − b
q
+
b
q − repw
≤
q − b
q
+ 2rw ,
w
since b
q, repw ∈ bw = b(pw , rw ). In particular,
q − repw
q − bq
+ 2rw 4
≤
≤1+ .
q
q − b
q − bq
t
In low dimensions, there is always a good separating ring. Indeed, consider the ≥ n/2
smallest ball b = b(p, r) that contains n/c1 points of P, where c1 is a sufficiently large
constant. Let b0 be the scaling of this ball by a factor of two. By a standard packing
argument, the ring b0 \ b can be covered with c = O(1) copies of b, none of which
can contain more than n/c1 points of P. It follows, that by picking c1 = 3c, we are
guaranteed that at least half the points of P are outside b0 . Now, the ring can be split b
into n/2 empty rings (by taking a sphere that passes through each point inside the ring),
≤ n/c1
and one of them would be of thickness at least r/n, and it would separate n/c points of
P from the outer n/2 of P. Doing this efficiently requires trading off some constants,
and some tedious details, as described in the following lemma.
Lemma 11.4.6 Given a set P of n points in IRd , one can compute a (1/n)-ring tree of b0
P in O(n log n) time.
Proof: The construction is recursive. Compute the ball D = b(p, r) that contains ≥ n/c points of P, such that r ≤ 2ropt (P, n/c),
where c is a constant to be determined shortly. We remind the reader that ropt (P, n/c) is the radius of the smallest ball that contains
n/c points of P, and the ball D can be computed in linear time, by Lemma 1.3.1. Consider the ball D0 of radius 2r centered at p.
The ball D0 can be covered l by
√ ma hypercube S with side length 4r. Furthermore,√ partition S into a grid such that every cell is of
side length r/L, for L = 16 d . Every cell in this grid has diameter ≤ 4r d/L ≤ r/4 ≤ ropt (P, n/c)/2. Thus, every grid cell can
d
4r
contain at most n/c points, since it can be covered with a ball of radius ropt (P, n/c)/4. There are M = r/L = (4L)d grid cells.
0 d d 0
Thus, D contains at most (4L) (n/c) points. Specifically, for c = 2(4L) the ball D contains at most n/2 points of P.
In particular, there must be a radius r0 such that r ≤ r0 ≤ 2r, and there is a h ≥ r/n, such that the ring b(p, r0 + h) \ b(p, r0 ) does
not contain any points of P in its interior.
Indeed, sort the points of P inside D0 \ D by their distances from p. There are n/2 numbers in the range of distances [r, 2r]. As
such, there must be an interval of length r/(n/2 + 1) which is empty. And this empty range, corresponds to the empty ring.
Computing r0 and h is done by computing the distance of each point from p, and partitioning the distance range [r, 2r] into 2n
equal length segments. In each segment, we register the point with minimum and maximum distance from c in this range. This can
be done in linear time using the floor function. Next, scan those buckets from left to right. Observe, that the maximum length gap
is realized by a maximum of one bucket together with a consecutive sequence of empty buckets, ending by the minimum of a non
empty bucket. As such, the maximum length interval can be computed in linear time, and yield r0 and h.
120
Thus, let v be the root of the new tree, set Pvin to be P ∩ b(p, 0 v v 0
r ) and Pout = P \ Pin , store bv = b(p, r ) and pv = p. Continue the
construction recursively on those two sets. Observe that Pin , Pout ≥ n/c, where c is a constant. It follows that the construction
v v
time of the algorithm is T (n) = O(n) + T Pvin + T Pvout = O(n log n), and the depth of the resulting tree is O(log n).
Combining the above two lemmas, we get the following result.
Theorem 11.4.7 Let P be a set of n points in IRd . One can preprocess it in O(n log n) time, such that given a query point q ∈ IRd ,
one can return a (1 + 4n)-ANN of q in P in O(log n) time.
Liner space. In the low dimensions, the seminal work of Arya et al. [AMN+ 98], mentioned above, was the first to offer linear
size data-structure, with logarithmic query time, such that the approximation quality is specified with the query. The query time
of Arya et al. is slightly worse than the running time of Theorem 11.3.3, since they maintain a heap of cells, always handling the
cell closest to the query point. This results in query time O(ε−d log n). It can be further improved to O(1/εd log(1/ε) + log n) by
observing that this heap has only very few delete-min, and many insertions. This observation is due to Duncan [Dun99].
Instead of having a separate ring-tree, Arya et al. rebalance the compressed quadtree directly. This results in nodes, which
correspond to cells that have the shape of an annulus (i.e., the region formed by the difference between two canonical grid cells).
Duncan [Dun99] and some other authors offered data-structure (called the BAR-tree) with similar query time, but it is seems
to be inferior, in practice, to Arya et al. work, for the reason that while the regions the nodes correspond to are convex, they have
higher descriptive complexity, and it is harder to compute the distance of the query point to a cell.
Faster query time. One can improve the query time if one is willing to specify ε during the construction of the data-structure,
resulting in a trade off between space for query time. In particular, Clarkson [Cla94] showed that one can construct a data-structure
of (roughly) size O(n/ε(d−1)/2 ), and query time O(ε−(d−1)/2 log n). Chan simplified and cleaned up this result [Cha98] and presented
also some other results.
√
√
Details on Faster Query Time. A set of points Q is ε-far from a query point q, if the
p − cQ
≥ diam(Q)/ ε,
where cQ is some point of Q. It is easy to verify that if we partition space around cQ into cones with central angle
√
O( ε) (this requires O(1/ε(d−1)/2 ) cones), then the most extreme point of Q in such a cone ψ, furthest away from cQ ,
√
is the (1 + ε)-approximate nearest neighbor for any query point inside ψ which is ε-far. Namely, we precompute
the ANN inside each cone, if the point is far enough. Furthermore, by careful implementation (i.e., grid in the angles
space), we can decide, in constant time, which cone the query point lies in. Thus, using O(1/ε(d−1)/2 ) space, we can
√
answer (1 + ε)-ANN queries for q, if the query point is ε-far, in constant time.
Next, construct this data-structure for every set Pv , for v ∈ T b(P). This results in a data-structure of size
(d−1)/2
O(n/ε ). Given a query point q, we use the algorithm of Theorem 11.3.3, and stop as soon as for a node v,
√
Pv is ε-far, and then we use the secondary data-structure for Pv . It is easy to verify that the algorithm would stop as
√
soon as diam(2v ) = O( εdP (q). As such, the number of nodes visited would be O(log n + 1/εd/2 ), and identical query
time.
Note, that we omitted the construction time (which requires some additional work to be done efficiently), and our
query time is slightly worse than the best known. The interested reader can check out the work by Chan [Cha98],
which is somewhat more complicated than what is outlined here.
The first to achieve O(log(n/ε)) query time (using near linear space), was Har-Peled [Har01b], using space roughly O(nε−d log2 n).
This was later simplified and improved by Arya and Malamatos [AM02], which present a data-structure with the same query time,
121
and of size O(n/εd ). Those data-structure relies on the notion of computing approximate Voronoi diagrams and performing point
location queries in those diagrams. By extending the notion of approximate Voronoi diagrams, Arya, Mount and Malamatos
[AMM02] showed that one can answer (1 + ε)-ANN queries in O(log(n/ε)) time, using O(n/ε(d−1) ) space. On the other end of the
spectrum, they showed that one can construct a data-structure of size O(n) and query time O(log n + 1/ε(d−1)/2 ) (note, that for this
data-structure ε > 0 has to be specified in advance). In particular, the later result breaks a space/query time tradeoff that all other
results suffers from (i.e., the query time multiplied by the construction time has dependency of 1/εd on ε).
Practical Considerations Arya et al. [AMN+ 98] implemented their algorithm. For most inputs, it is essentially a kd-tree.
The code of their library was carefully optimized and is very efficient. In particular, in practice, I would expect it to beat most of
the algorithms mentioned above. The code of their implementation is available online as a library [AM98].
Higher Dimensions. All our results have exponential dependency on the dimension, in query and preprocessing time (al-
though the space can be probably be made subexponential with careful implementation). Getting a subexponential algorithms
requires a completely different techniques, and would be discussed in detail at some other point.
Stronger computation models. If one assume that the points have integer coordinates, in the range [1, U], then approx-
imate nearest-neighbor queries can be answered in (roughly) O(log log U + 1/εd ) time [AEIS99], or even O(log log(U/ε)) time
[Har01b]. The algorithm of Har-Peled [Har01b] relies on computing a compressed quadtree of height O(log(U/ε)), and performing
fast point-location query in it. This only requires using the floor function and hashing (note, that the algorithm of Theorem 11.3.3
uses the floor function and hashing during the construction, but it is not used during the query). In fact, if one is allowed to slightly
blowup the space (by a factor U δ , where δ > 0 is an arbitrary constant), the ANN query time can be improved to constant [HM04].
By shifting quadtrees, and creating d + 1 quadtrees, one can argue that the approximate nearest neighbor must lie in the same
cell (and of the “right” size) of the query point in one of those quadtrees. Next, one can map the points into a real number, by using
the natural space filling curve associated with each quadtree. This results in d + 1 lists of points. One can argue that a constant
approximate neighbor must be adjacent to the query point in one of those lists. This can be later improved into (1 + ε)-ANN by
spreading 1/εd points. This simple algorithm is due to Chan [Cha02].
The reader might wonder why we bothered with a considerably more involved algorithm. There are several reasons: (i) This
algorithm requires the numbers to be integers of limited length (i.e., O(log U) bits), and (ii) it requires shuffling of bits on those
integers (i.e., for computing the inverse of the space filling curve) in constant time, and (iii) the assumption is that one can combine
d such integers into a single integer and perform XOR on their bits in constant time. The last two assumptions are not reasonable
when the input is made out of floating point numbers.
Further research. At least (and only) in low dimensions, the ANN problem seems to be essentially solved both in theory
and practice (such proclamations are inherently dangerous, and should be taken with considerable amount of healthy skepticism).
Indeed, for ε > 1/ log1/d n, the current data structure of Theorem 11.3.3 provide logarithmic query time. Thus, ε has to be quite
small for the query time to become bad enough that one would wish to speed it up.
Main directions for further research seems to be working on this problem in higher dimensions, and solving it in other compu-
tation models.
Surveys. A survey on approximate nearest neighbor search in high dimensions is by Indyk [Ind04]. In low dimensions, there is
a survey by Arya and Mount [AM04].
11.6 Exercises
Exercise 11.6.1 (Better Ring Tree) [10 Points]
Let P be a set of n points in IRd . Show how to build a ring tree, of linear size, that can answer O(log n)-ANN queries in O(log n)
time. [Hint: Show, that there is always a ring containing O(n/ log n) points, such that it is of width w, and its interior radius is
O(w log n). Next, build a ring tree, replicating the points in both children of this ring node. Argue that the size of the resulting tree
is linear, and prove the claimed bound on the query time and quality of approximation.]
122
Chapter 12
Today I know that everything watches, that nothing goes unseen, and that even wallpaper has a better memory than
ours. It isn’t God in His heaven that sees all. A kitchen chair, a coat-hanger a half-filled ash tray, or the wood replica
of a woman name Niobe, can perfectly well serve as an unforgetting witness to every one of our acts.
– – The tin drum, Gunter Grass
Definition 12.1.1 A metric space M is a pair (X, d) where X is a set and d : X × X → [0, ∞) is a metric, satisfying the following
axioms: (i) d(x, y) = 0 iff x = y, (ii) d(x, y) = d(y, x), and (iii) d(x, y) + d(y, z) ≥ d(x, z) (triangle inequality).
For example, IR2 with the regular euclidean distance is a metric space.
In the following, we are going to assume that we are provided with a black box, such that given two points x, y ∈ X, we can
compute the distances d(x, y) in constant time.
Definition 12.1.2 Let P be a set of elements, and T a tree having the elements for P as leaves. The tree T defines a hierarchically
well-separated tree (HST) over the points of P, if to each vertex u ∈ T there is associated a label ∆u ≥ 0, such that ∆u = 0 if and
only if u is a leaf of T . The labels are such that if a vertex u is a child of a vertex v then ∆u ≤ ∆v . The distance between two leaves
x, y ∈ T is defined as ∆lca(x,y) , where lca(x, y) is the least common ancestor of x and y in T .
If every internal node of T has exactly two children, we will refer to it as being a binary HST (BHST).
For convenience, we will assume that the underlying tree is binary (any HST can be converted to binary HST in linear time,
while retaining the underlying distances).
We will also associate
with every vertex u ∈ T , an arbitrary leaf repu of the subtree rooted
at u. We also require that rep ∈ rep v is a child of u .
u v
A metric N is said to t-approximate the metric M, if they are on the same set of points, and dM (u, v) ≤ dN (u, v) ≤ t · dM (u, v),
for any u, v ∈ M.
It is not hard to see that any n-point metric is (n − 1)-approximated by some HST.
Lemma 12.1.3 Given a weighted connected graph G on n vertices and m edges, it is possible to construct, in O(n log n + m) time,
a binary HST H that (n − 1)-approximates the shortest path metric on G.
123
Proof: Compute the minimum spanning tree of G in O(n log n + m) time, and let T denote this tree.
Sort the edges of T in non-decreasing order, and add them to the graph one by one. The HST is built bottom up. At each
point we have a collection of HSTs, each corresponds to a connected component of the current graph. When an added edge merges
two connected components, we merge the two corresponding HSTs into one by adding a new common root for the two HST, and
labeling this root with the edge’s weight times n − 1. This algorithm is only slight variation on Kruskal algorithm, and has the same
running time.
We next estimate the approximation factor. Let x, y be two vertices of G. Denote by e the first edge that was added in the
process above that made x and y in the same connected component C. Note that at that point of time e is the heaviest edge in C, so
w(e) ≤ dG (x, y) ≤ (|C| − 1) w(e) ≤ (n − 1) w(e). Since dH (x, y) = (n − 1) w(e), we are done.
Corollary 12.1.4 For a set P of n points in IRd , one can construct, in O(n log n) time, a BHST H that (2n − 2)-approximates the
distances of points in P. That is, for any p, q ∈ P, we have dH (p, q)/(2n − 2) ≤ kpqk ≤ dH (p, q).
Proof: We remind the reader, that in IRd , one can compute a 2-spanner for P of size O(n), in O(n log n) time (see Theorem 3.2.1).
Let G be this spanner, and apply Lemma 12.1.3 to this spanner. Let H be the resulting metric. For any p, q ∈ P, we have
kpqk ≤ dH (p, q) ≤ (n − 1)dG (p, q) ≤ 2(n − 1) kpqk.
Corollary 12.1.4 is unique to IRd since for general metric spaces, no HST can be computed in subquadratic time, see Exer-
cise 12.5.1.
Corollary 12.1.5 For a set P of n points in a metric space M, one can compute a HST H that (n − 1)-approximates the metric dM .
Our objective, is to show that (1 + ε)-ANN queries can be reduced to target ball queries, among a near linear size set of balls.
But let us first start from a “silly” result, to get some intuition about the problem.
d
Lemma 12.2.2 Let B = ∪∞ i
i=−∞ B(P, (1 + ε) ), where B(P, r) = ∪ p∈P b(p, r). For q ∈ IR , let b = B (q), and let p ∈ P be the center
of b. Then, p is (1 + ε)-ANN to q.
q be the nearest neighbor to q in P, and let r = dP (q). Let i be such that (1 + ε)i < r ≤ (1 + ε)i+1 . Clearly,
Proof: Let b
radius(b) > (1 + ε)i . On the other hand, p ∈ b(b
q, (1 + ε)i+1 ). It follows that, kqpk ≤ (1 + ε)i+1 ≤ (1 + ε)dP (q). Implying that p is
(1 + ε)-ANN to P.
Remark 12.2.3 Here is an intuitive construction of a set of balls of polynomial size, such that target queries answer (1 + ε)-ANN
correctly. Indeed, consider two points u, v ∈ P. As far as correctness, we care if the ANN returns either u or v, for a query point q,
only if dP (q) ∈ [dM (u, v)/4, 2dM (u, v)/ε] (for shorter distances, either u or v are the unique ANN, and for longer distances either
one of them is ANN for the set {u, v}).
Next, consider a range of distances [(1 + ε)i , (1 + ε)i+1 ] to be active, if there is u, v ∈ P, such that ε(1 + ε)i ≤ dM (u, v) ≤
4(1 + ε)i+1 /ε. Clearly, the number of active intervals is O(n2 ε−1 log(1/ε)) (one can prove a better bound). Generate a ball for each
point of P, for each active range. Clearly, the resulting number of balls is polynomial, and can be used to resolve ANN queries.
Getting the number of balls to be near linear, requires to be more careful, and the details are provided below.
124
Definition 12.2.5 One can in fact, resolve ANN queries on a range of distances, [a, b], by building NNbr data structures, with
exponential jumps on this range. Formally, let Ni = NNbr(P, ri ), where ri = (1+ε)i a, for i = 0, . . . , M −1, where M = log1+ε (b/a) .
And let N M = NNbr(P, M), where r M = b. We will denote this set of data-structures by I(P,b a, b, ε) = {N0 , . . . , N M }. We refer to I
b
is interval near-neighbor data structure.
Lemma 12.2.6 Given P as above, and parameters a ≤ b and ε > 0. We have: (i) I(P, b a, b, ε) is made out of O(ε−1 log(b/a)) NNbr
b −1
data structures, and (ii) I(P, a, b, ε) contains O(ε n log(b/a)) balls overall.
Furthermore, one can decide if either of the following options holds: (i) dP (q) < a, (ii) dP (q) > b, (iii) or return a number r and
a point p ∈ P, such that kpqk ≤ dP (q) ≤ (1 + ε) kpqk. This requires two NNbr queries if (i) or (ii) holds, and O(log(ε−1 log(b/a)))
otherwise.
Proof: Given a query point q, we first check if dP (q) ≤ a, by querying N0 , and if so, the algorithm returns “dP (q) ≤ a”.
Otherwise, we check if dP (q) > b, by querying N M , and if so, the algorithm returns “dP (q) > b”.
Otherwise, let Xi = 1 if and only if dP (q) ≤ ri , for i = 0, . . . , m. We can determine the value of Xi by performing a query in
the data-structure Ni . Clearly, X0 , X1 , . . . , X M is a sequence of zeros, followed by a sequence of ones. As such, we can find the i,
such that Xi = 0 and Xi+1 = 1, by performing a binary search. This would require O(log M) queries. In this case, we have that
ri < dP (q) ≤ ri+1 ≤ (1 + ε)ri . Namely, the ball in NNbr(P, ri+1 ) covering q, corresponds to (1 + ε)-ANN to q.
To get the state bounds, observe that by the Taylor’s expansion ln(1 + x) = x − x2 /2 + x3 /3 + · · · ≥ x/2, for x ≥ 1. Thus,
!
ln(b/a)
M = log1+ε (b/a) = O = O log(b/a)/ε ,
ln(1 + ε)
since 1 ≥ ε > 0.
Corollary 12.2.7 Let P be a set of n points in M, and let a < b be real numbers. For a query point q ∈ M, such that dP (q) ∈ [a, b],
the target query over the set of balls NNbr(P, a, b, ε) returns a ball centered at (1 + ε)-ANN to q.
Lemma 12.2.6 implies that we can “cheaply” resolve (1 + ε)-ANN over intervals which are not too long.
S
Definition 12.2.8 For a set P of n points in a metric space M, let Uballs (P, r) = p∈P b(p, r) denote the union of balls of radius r
around the points of P.
Lemma 12.2.9 Let Q be a set of m points in M and r > 0 a real number, such that Uballs (Q, r) is a connected set. Then: (i) Any
two points p, q ∈ Q are in distance ≤ 2r(m − 1) from each other. (ii) For q ∈ M a query point, such that dQ (q) > 2mr/δ, then any
point of Q is (1 + δ)-ANN of q.
Proof: (i) Since Uballs (Q, r) is a connected set, there is a path of length ≤ (m − 1)2r between any two points x, y of P. Indeed,
consider the graph G, which connects two vertices u, v ∈ P, if dM (u, v) ≤ 2r. Since Uballs (Q, r) is a connected set, it follows that
G is connected. As such, there is a path of length m − 1 between x and y in G. This corresponds to a path of length ≤ (m − 1)2r
connecting x to y in M, and this path lies inside Uballs (Q, r).
(ii) For any p ∈ Q, we have
2mr
≤ dM (q, p) ≤ dM (q,b
q) + dM (b
q, p) ≤ dM (q,b
q) + 2mr ≤ (1 + δ)dM (q,b
q),
δ
where bq is the nearest-neighbor of q in Q.
Lemma 12.2.9 implies that for faraway query points, a cluster points Q which are close together can be treated as a single
point.
Proof: We are going to build a tree D, such that each node v would have an interval near-neighbor data-structure I bv associated
with it. As we traverse down the tree, we will use those data-structure to decide to what child to continue the search into.
Compute the minimum value r > 0 such that Uballs (P, r) is made out of dn/2e connected components. We set I broot(T ) =
b
I(P, r, R, ε/4), where R = 2cµnr/ε, µ is a global parameter and c > 1 is an appropriate constant, both to be determined shortly. For
each connected component C of Uballs (P, r), we build recursively a tree for C ∩ P (i.e., the points corresponding to C), and hung it
on root(T ). Furthermore, from each such connected component C, we pick one representative point q ∈ C ∩ P, and let Q be this set
125
of points. We also build (recursively) a tree for Q, and hang it on root(T ). We will refer to the child of root(T ) corresponding to Q,
as the outer child of root(T ).
Given a query point q ∈ M, use I broot(T ) = I(P,
b r, R, ε/4) to determine, if dP (q) ≤ r. If so, we continue the search recursively in
the relevant child built for the connected component of Uballs (P, r) containing q (we know which connected component it is, because
broot(T ) also returns a point of P in distance ≤ r from q). If dP (q) ∈ [r, R], then we will find its (1 + ε)-ANN from I
I broot(T ) . ‘Otherwise,
dP (q) > R, and we continue the search recursively in the outer child of root(T ).
Observe, that in any case, we continue the search on a set of balls of size ≤ n/2 + 1. As such, after number of steps ≤ log3/2 n
the search halts.
correctness. If during the search the algorithm traverse from a node v down to one of the connected components which is a
child of Uballs (Pv , rv ), then no error is introduced, where Pv is the point set used in constructing v, and rv is the value of r used in the
construction. If the query is resolved by I bv , then a (1 + ε/4) error is introduced into the quality of approximation. If the algorithm
continues the search in the outer child of v, then an error of 1 + δv is introduced in the answer, where δv = ε/(cµ), by Lemma 12.2.9
(ii). Thus, the overall quality of the ANN returned in the worst case is
! !
ε Y ε logY
log3/2 n 3/2 n
ε cε
t ≤ 1+ 1+ ≤ exp exp ,
4 i=1 cµ 4 i=1 cµ
l m Plog n
since x ≤ e x for x ≤ 1. Thus, setting µ = log3/2 n and c to be a sufficiently large constant, we have t ≤ exp ε/4 + i=13/2 ε/cµ ≤
exp(ε/2) ≤ 1 + ε, since ε < 1/2. We used the fact that e x ≤ (1 + 2x) for x ≤ 1/2, as can be easily verified.
Number of queries. As the search algorithm proceeds down the tree D, at most two NNbr queries are performed at each
node. At last node of the traversal, the algorithm performs O(log(ε−1 log(n/ε))) = O(log(n/ε)) queries, by Lemma 12.2.6
Number of balls. We need a new interpretation of the construction algorithm. In particular, let H be the HST constructed
for P using the exact distances. It is easy to observe, that a connected component of Uballs (P, r) is represented by a node v and its
subtree in H. In particular, the recursive construction for each connected component, is essentially calling the algorithm recursively
on a subtree of H. Let V be the set of nodes of H which represent connected components of Uballs (P, r).
Similarly, the outer recursive
call, can be charged to the upper tree of H, having the nodes of V for leafs. Indeed, the outer set
of points, is the set rep(V) = repv v ∈ V . Let b L be this collection of subtrees of H. Clearly, those subtrees are not disjoint in
their vertices, but they are disjoint in their edges. The total number of edges is O(n), and b L ≥ n/2.
Namely, we can interpret the algorithm (somewhat counter intuitively) as working on H. At every stage, we break the current
HST into subtrees, and recursively continue the construction on those connected subtrees.
In particular, for a node v ∈ T , let nv be the number of children of v. Clearly, |Pv | = O(nv ), and since we can charge each such
P
child to the fact that we are disconnecting the edges of H from each other, we have that vinD nv = O(n).
At a node v ∈ D, we have that the data-structure I bv requires storing mv = O(ε |Pv | log(Rv /rv )) = O(ε−1 nv log(µnv /ε)) balls.
−1
−1
In particular, mv = O(ε nv log(µnv /ε)). We conclude, that the overall number of balls stored in D is
X nv !
µnv n n log n n n
O log = O log = O log .
v∈D
ε ε ε ε ε ε
A single target query. The claim that a target query on B answers (1 + ε)-ANN query on P, follows by an inductive proof
on the algorithm execution. Indeed, if the algorithm is at node v, and let Bv be the set of balls stored in the subtree of v. We claim,
that if the algorithm continue the search in w, then all the balls of U = Bv \ Bw are not relevant for the target query.
Indeed, if w is the outer child of v, then all the balls in U are too small, and none of them contains q. Otherwise, q ∈
Uballs (Pv , rV ). As such, all the balls of Bv that are bigger than the balls of Uballs (Pv , rv ), and as such they can be ignored. Furthermore,
all the other balls stored in other children of v, are of radius ≤ rv , and are not in the same connected component as Uballs (Pw , rv ) in
Uballs (Pv , rv ), as such, none of them is relevant for the target query.
The only other case, is when the algorithm stop the search at v. But at this case, we have rv ≤ dP (q) ≤ Rv , and then all the balls
in the children of v are either too big (i.e., the balls stored at the outer child are of radius > Rv ), or too small (i.e., the balls stored at
the regular children are of radius < rv ). Thus, only the balls of I(P b v , rv , Rv , ε/4) are relevant, and there we know that the returend
ball is a (1 + ε/4)-ANN by Corollary 12.2.7.
Thus, the point returned by the target query on B, is identical to running the search algorithm on D, and as such, by the above
prove, the result is correct.
126
Lemma 12.2.11 Given a set P of n points in M and H be a HST of P that t-approximates M. Then one can construct data-
structures D that answers (1 + ε)-ANN queries, by performing O(log(n/ε)) NNbr queries. The total number of balls stored at D is
O(nε−1 log(tn/ε)).
The construction time is O(nε−1 log(tn/ε)).
Proof:
We reimplement the algorithmof Theorem 12.2.10, by doing the decomposition directly on the HST H. Indeed, let
U = ∆v v ∈ H, and v is an internal node . Let ` be the median value of U. Let V be the set of all the nodes v of H, such that
b = I(P,
v ∈ V, if ∆v ≤ ` and ∆p(v) > `. Next, build an I b r, R), where
where c is a large enough constant. As in the algorithm of Theorem 12.2.10, the set V breaks H into dn/2e + 1 subtrees, and
we continue the construction on each such connected component. In particular, for the new root node we create, set rroot(D) = r,
Rroot(D) = R, and `root(D) = `.
Observe, that for a query point q ∈ M, if dQ (p) ≤ r, then I b would return a ball, which in turn would correspond to a point
stored in one of the subtrees rooted at a node v ∈ V. Let C be the connected component of Uballs (P, r) that contains q, and let
Q = P ∩ C. It is easy to verify that Q ⊆ Pv . Indeed, since H t-approximates dM , and Q is a connected component of Uballs (P, r),
it follows that Q must be in the same connected component of H, when considering distances ≤ t · r < `. But such a connected
component of H, is no more than a node v ∈ H, and the points of P stored in the subtree v in H. However, such a node v is either
in V, or one of its ancestors is in V.
For a node v, the number of balls of IbV is O(ε−1 nv log((log n)nv y/ε)). Thus, the overall number of balls in the data-structure is
as claimed.
b data-structures, and as such the same bound holds.
As for the construction time, it is dominated by the size of I
Using Corollary 12.1.4 with Lemma 12.2.11 we get the following.
Theorem 12.2.12 Let P be a set of n points in IRd , and ε > 0 a parameter. One can compute an a set of O(nε−1 log(tn/ε)) balls
(the same bound holds for the running time), such that a ANN can be resolved by a target ball query on this set of balls.
Alternatively, one can construct a data-structure where (1 + ε)-ANN can be resolved by O(log(n/ε)) NNbr queries.
Definition 12.3.1 For a ball b = b(p, r), a set b≈ is an (1 + ε)-approximation to b, if b ⊆ b≈ ⊆ b(p, (1 + ε)r).
For a set of balls B, the set B≈ is an (1 + ε)-approximation to B, if for any ball b ∈ B there is a corresponding (1 + ε)-
approximation b≈ ∈ B≈ . For a set b≈ ∈ B≈ , let b ∈ B denote the ball corresponding to b≈ , rb be the radius of b, and let pb ∈ P
denote the center of b.
For a query point q ∈ M, the target set of B≈ in q, is the set b≈ of B≈ that contains q and has the smallest radius rb .
Lemma 12.3.2 Let I≈ = I≈ (P, r, R, ε/16) be a (1 + ε/16)-approximation to I(P, b r, R, ε/16). For a query point, q ∈ M, if I≈
returns a ball centered at p ∈ P of radius α, and α ∈ [r, R] then p is (1 + ε/4)-ANN to q.
Proof: The data-structure of I≈ returns p as an ANN to q, if and only if there are two consecutive indices, such that q
is inside the union of the approximate balls of Ni+1 but not inside the balls of Ni . Thus, r(1 + ε/16)i ≤ dP (q) ≤ d(p, q) ≤
r(1 + ε/16)i+1 (1 + ε/16) ≤ (1 + ε/4)r. Thus p is indeed (1 + ε/4)-ANN.
Lemma 12.3.3 Let P be a set of n points in a metric space M, and let B be a set of balls with centers at P, computed by the
algorithm of Lemma 12.2.11, such that one can answer (1 + ε/16)-ANN queries on P, by performing a single target query in B.
Let B≈ be a (1 + ε/16)-approximation to B. A target query on B≈ , for a query point q, returns a (1 + ε)-ANN to q in P.
Proof: Let D be the tree computed by Lemma 12.2.11. We prove the correctness of the algorithm by an inductive proof over
the height of D, similar in nature to the proofs of Lemma 12.2.11 and Theorem 12.2.10.
127
Indeed, for a node v ∈ D, let I≈v = I≈ (Pv , rv , Rv , ε/16) be the set of (1 + ε/16)-approximate balls of B≈ that corresponds to the
bv = I(P
set of balls stored in I b v , rv , Rv , ε/4). If the algorithm stops at v, we know by Lemma 12.3.2 that we returned a (1+ε/4)-ANN
to q in Pv .
Otherwise, if we continued the search into the outer child of v using I bv , then we would also continue the search into the this
node when using I≈v . As such, by induction the result is correct.
Otherwise, we continue the search into a node w, where q ∈ Uballs (Pw , (1 + ε/16)rv ). We observe that because of the factor 2
slackness of Eq. (12.1), we are guaranteed to continue the search in the right connected component of Uballs (Pv , `v ).
Thus, the quality of approximation argument of Lemma 12.2.11 and Theorem 12.2.10 still holds, and require easy adaption to
this case (we omit the straightforward by tedios details). We conclude, that the result returned is correct.
Open problems. The argument showing that one can use approximate near-neighbor data-structure instead of exact (i.e.,
Lemma 12.3.3), is tedious and far from being elegant; see Exercise 12.5.2. A natural question for further research, is to try and
give a simple concrete condition on the set of balls, such that using approximate near neighbor data-structure still give the correct
result. Curretly, it seems that what we need is somekind of separability property. However, it would be nice to give a simpelr direct
condition.
As mentioned above, Sabharwal et al. [SSS02] showed a reduction from ANN to linear number of balls (ignoring the depen-
dency on ε). However, it seems like a simpler construction should work.
12.5 Exercises
Exercise 12.5.1 (Lower bound on computing HST) [10 Points]
Show, that by adversarial argument, that for any t > 1, we have: Any
algorithm computing a HST H for n points in a metric
space M, that t approximates dM , must in the worst case inspect all n2 distances in the metric. Thus, computing a HST requires
quadratic time in the worst case.
Conjecture 12.5.3 Let P be a set of n points in the plane, and let B be a set of balls such that one
can answer (1 + ε/c)-ANN queries on P, by performing a single target query in B. Now, let B≈ be a
(1 + ε/c)-approximation to B. Then a target query on B≈ , answers (1 + ε)-ANN on P, for c large enough
constant.
Give a counter example to the above conjecture, showing that it is incorrect.
128
Chapter 13
“She had given him a smile, first because that was more or less what she was there for, and then because she had
never seen him before and she had a prejudice in favor of people she did not know.”
– – The roots of heaven, Romain Gary
13.1 Introduction
A
Voronoi
diagram of a pointset P ⊆ IRd is a partition of space into regions, such that the cell of p ∈ P, is V(p, P) =
x kxpk ≤ kxp0 k for all p0 ∈ P . Vornoi diagrams are a powerful tool and have numerous applications [Aur91].
One problem with Vornoi diagrams is that their descriptive complexity is O(ndd/2e ) in IRd , in the worst case. See Figure 13.1
for an exmaple in three dimensions. It is a natrual question to ask, whether one can reduce the complexity to linear (or near linear)
by allow some approximation.
Definition 13.1.1 (Approximate Voronoi Diagram.) Given a set P of n points in IRd , and parameter ε > 0, an (1+ε)-approximated
Voronoi diagram of P, is a partition V of space into regions, such that for any region ϕ ∈ V, there is an associated point repϕ ∈ P,
such that for any x ∈ ϕ, we have that repϕ is a (1 + ε)-ANN for x in P. We will refer to V as (1 + ε)-AVD.
129
(a) (b) (c)
Figure 13.1: (a) The point-set in 3D inducting a Voronoi diagram of quadratic complexity. (b) Some cells
in this Voronoi diagram. Note that the cells are thin and flat, and every cell from the lower part touches the
cells on the upper part. (c) The contact surface between the two parts of the Voronoi diagram has quadratic
complexity, and thus the Voronoi diagram itself has quadratic complexity.
Theorem 13.2.1 Let P be a set of n points in IRd . One can build a compressed quadtree T b, in O(nε−d log2 (n/ε)) time, of size
−d b. Such a point-location
O(nε log(n/ε)), such that (1 + ε)-ANN query on P, can be answered by a single point-location query in T
query takes O(log(n/ε)) time.
Proof: The construction is described above. We only need to prove the bounds on the running time. It takes O(nε−1 log(n/ε))
time to compute B. For every such ball, we generate O(1/εd ) canonical grid cells that cover it. Let C be the resulting set of grid
cells (after filtering multiple instance of the same grid cell). Naively, the size of C is bounded by O |B| /εd . However, it is easy to
verify that B has a large number of balls of similar size centered at the same point (because the set I b has a lot of such balls). In
particular, if we have the set of balls b
I({p} , r, 2r, ε/16), it requires only O(1/εd
) canonical grid cells to approximate it. Thus, we
can bound |C| by N = O |B| /εd−1 = O(nε−d log(n/ε)). We can also compute C in this time, by careful implementation (we omit
the straightforward and tedious details).
Constructing T b for C takes O(N log N) = O(nε−d log2 (n/ε)) time (Lemma 2.2.5). The resulting compressed quadtree is of size
O(N). Point location queries at T b takes O(log N) = O(log(n/ε)) time. Given a query point, and the leaf v, such that q ∈ 2v , we
need to find the first ancestor above v that has a point associated with it. This can be done in constant ti1me, by preprocessing T b,
by propagating down the compressed quadtree the nodes they are associated with.
Definition 13.3.1 (Exponential Grid.) For a point p ∈ IRd , and parameters r, R and ε > 0, let GE (p, r, R, ε) denote an exponential
grid centered at p.
130
Figure 13.2: Exponential grid.
i 0
let bi = b(p, ri ), for i = 0, . . . , lg
R/r
, where ri = r2 . Next, let Gi be the set of cells of the canonical grid Gαi that intersects
bi , where αi = 2blg(εri /(16d))c . Clearly, G0i = O(1/εd ). We remove from G0i all the canonical cells completely covered by cells of Gi−1 .
Similarly, for cells the that are partially covered in G0i by cells in G0i−1 , we replace them by the cells covering them in Gαi−1 . Let Gi
be the resulting set of canonical grid cells. And let GE (p, r, R, ε) = ∪i Gi . We have |GE (p, r, R, ε)| = O(ε−d log(R/r)). Furthermore,
it can be computed in linear time in its size; see Figure 13.2.
Let P be a set of n points, and let 0 < ε < 1/2. As before, we assume that P ⊆ [0.5 − ε/d, 0.5 + ε/d]d .
Compute a (1/(8d))−1 -WSPD W of P. Note that |W| = O(n). For every pair X = {u, v} ∈ W, let `uv = kuvk, and consider the
set of canonical cells
W(u, v) = GE (repu , `uv /4, 4`uv /ε, ε) ∪ GE (repv , `uv /4, 4`uv /ε, ε).
For a query point q, if dP (q) ∈ [`uv /4, `uv /ε], and the nearest neighbor of q is in Pu ∪ Pv , then the cell 2 ∈ W(u, v) containing q is
of the right size, it comply with Eq. (13.1), and as such any (1 + ε/4)-ANN to any point of 2 is a (1 + ε)-ANN to q.
Thus, let W = ∪{u,v}∈W W(u, v) ∪ [0, 1]d . Let T b be a compressed quadtree constructed so that it contains all the canonical nodes
of W as nodes (we remind the reader that this can be done in O(|W| log |W|) time; see Lemma 2.2.5). Next, let U be the space
decomposition induced by the leafs of T b.
For each cell 2 ∈ U, take an arbitrary point inside it rep2 , and compute a (1 + ε/4)-ANN to p. Let rep g2 ∈ P be this ANN, and
store it together with 2.
Proof: Let q be am arbitrary query point, and let 2 ∈ U be the cell containing q. Let rep2 ∈ 2 be its representative, and let repg2 ∈ P
be the (1 + ε/4)-ANN to rep2 stored at 2. Also, let b q be the nearest neighbor
of q in g2 = b
P. If rep q then we are done. Otherwise,
g2 ∈ Pu and b
consider the pair {u, v} ∈ W such that rep q ∈ Pv . Finally, let ` =
repgb q
.
2
If
qbq
> `/ε then rep
g2 is (1 + ε)-ANN since
q rep g2
≤
qb q
+
b g2
≤ (1 + ε)
qb
q rep q
.
If
qbq
< `/4, then by the construction of W(u, v), we have that diam(2) ≤ (ε/16)
repu repv
≤ ε`/8. This holds by the
construction of the WSPD, where repu , repv are the representative from the WSPD construction of Pu and Pv , respecitvely. See
Figure 13.3. But then,
` `ε 5
q rep2
≤
qb
b q
+ diam(2) ≤ + < `,
4 8 16
for ε ≤ 1/2. On the other hand,
g2
≥
repu repv
−
repvb
rep2 rep q
−
b
qq
− diam(2) −
repu repg2
≥ ` − `/8 − `/4 − ε`/8 − `/8 ≥ (7/16)`.
q
≤ (5/16)(9/8)` < `/2 ≤
rep2 rep
But then, (1 + ε/4)
rep2b g2
. A contradiction, because rep
g2 is not (1 + ε/4)-ANN in P for
rep2 . See Figure 13.3.
131
q rep
≤ l/4
qb r]
ep
`
l/8
If
qb
q
∈ [`/4, `/ε], then by the construction of W(u, v), we have that diam(2) ≤ (ε/4)dPu ∪Pv (q). We have dP (z) ≤ (1 +
ε/4)dP (q) and
g2
≤
q rep2
+
rep2 rep
q rep g2
≤ diam(2) + (1 + ε/4)dP (rep2 )
≤ (ε/4)dP (q) + (1 + ε/4)(1 + ε/4)dP (q) ≤ (1 + ε)dP (q),
as required.
Theorem 13.3.3 Given a set P of n points in IRd one can compute, in O(n/εd log(1/ε)(ε−d + log(n/ε))) time, a (1 + ε)-AVD of P.
The AVD is of complexity O(nε−d log(1/ε)).
Proof: The total number of cubes in W is O(nε−d log(1/ε)), and W can also be computed in this time, as described above. For
each node of W we need to perform a (1 + ε/4)-ANN query on P. After O(n log n) preprocessing, such queries can be answered in
O(log n + 1/εd ) time (Theorem 11.3.3). Thus, we can answer those queries, and built the overall data-structure, in the time stated
in the theorem.
132
Chapter 14
In this chapter, we will prove that given a set P of n points in IRd , one can reduce the dimension of the points to k = O(ε−2 log n) and
distances are 1 ± ε reserved. Surprisingly, this reduction is done by randomly picking a subspace of k dimensions and projecting
the points into this random subspace. One way of thinking about this result is that we are “compressing” the input of size nd (i.e.,
n points with d coordinates) into size O(nε−2 log n), while (approximately) preserving distances.
Theorem 14.1.2 (Brunn-Minkowski inequality) Let A and B be two non-empty compact sets in IRn . Then
Vol(A + B)1/n ≥ Vol(A)1/n + Vol(B)1/n .
Definition 14.1.3 A set A ⊆ IRn is a brick set if it is the union of finitely many (close) axis parallel boxes with disjoint interiors.
It is intuitively clear, by limit arguments, that proving Theorem 14.1.2 for brick sets will imply it for the general case.
Lemma 14.1.4 (Brunn-Minkowski inequality for Brick Sets) Let A and B be two non-empty brick sets in IRn . Then
Vol(A + B)1/n ≥ Vol(A)1/n + Vol(B)1/n
Proof: By induction on the number k of bricks in A and B. If k = 2 then A and B are just bricks, with dimensions a1 , . . . , an
and b1 , . . . , bn , respectively. In this case, the dimensions of A + B are a1 + b1 , . . . , an + bn , as can be easily verified. Thus, we need
Qn 1/n Qn Q
to prove that i=1 ai + i=1 bi 1/n ≤ ni=1 (ai + bi ) 1/n . Dividing the left side by the right side, we have
n n 1/n
Y ai 1/n Y bi 1 X ai
n
1 X bi
n
+ ≤ + = 1,
i=1
ai + bi i=1
ai + bi n i=1 ai + bi n i=1 ai + bi
by the generalized arithmetic-geometric mean inequality¯ , and the claim follows for this case.
¯
Here is a proof of this generalized form: Let x1 , . . . , xn be n positive real numbers. Consider the quantity R = √x1 x2 · · · xn . If we
√
fix the sum of the n numbers to be equal α, then R is maximized when all the xi s are equal. Thus, n x1 x2 · · · xn ≤ n (α/n)n = α/n =
(x1 + · · · + xn )/n.
133
Now let k > 2 and suppose that the Brunn-Minkowski inequality holds for any pair of brick sets with fewer than k bricks
(together). Let A, B be a pair of sets having k bricks together, and A has at least two (disjoint) bricks. However, this implies that
there is an axis parallel hyperplane h that separates between the interior of one brick of A and the interior of another brick of A (the
hyperplane h might intersect other bricks of A). Assume that h is the hyperplane x1 = 0 (this can be achieved by translation and
renaming of coordinates).
Let A+ = A ∩ h+ and A− = A ∩ h− , where h+ and h− are the two open half spaces induced by h. Let A+ and A− be the closure
of A and A− , respectively. Clearly, A+ and A− are both brick sets with (at least) one fewer brick than A.
+
Next, observe that the claim is translation invariant, and as such, let us translate B so that its volume is split by h in the same
ratio A’s volume is being split. Denote the two parts of B by B+ and B− , respectively. Let ρ = Vol(A+ )/ Vol(A) = Vol(B+ )/ Vol(B)
(if Vol(A) = 0 or Vol(B) = 0 the claim trivially holds).
Observe, that A+ + B+ ⊆ A + B, and it lies on one side of h, and similarly A− + B− ⊆ A + B and it lies on the other side of h.
Thus, by induction, we have
Vol(A + B) ≥ Vol(A+ + B+ ) + Vol(A− + B− )
n n
≥ Vol(A+ )1/n + Vol(B+ )1/n + Vol(A− )1/n + Vol(B− )1/n
n
= ρ1/n Vol(A)1/n + ρ1/n Vol(B)1/n
n
+ (1 − ρ)1/n Vol(A)1/n + (1 − ρ)1/n Vol(B)1/n
n
= (ρ + (1 − ρ)) Vol(A)1/n + Vol(B)1/n
n
= Vol(A)1/n + Vol(B)1/n ,
= Vol(A)1/n + Vol(B)1/n .
Theorem 14.1.5 (Brunn-Minkowski for slice volumes.) Let P be a convex set in IRn+1 , and let A = P∩(x1 = a), B = P∩(x1 = b)
and C = P ∩ (x1 = c) be three slices of A, for a < b < c. We have Vol(B) ≥ min(Vol(A), Vol(C)).
In fact, consider the function
v(t) = (Vol(P ∩ (x1 = t)))1/n ,
and let I = [tmin , tmax ] be the interval where the hyperplane x1 = t intersects P. Then, v(t) is concave in I.
Proof: If a or c are outside I, then Vol(A) = 0 or Vol(C) = 0, respectively, and then the claim trivially holds.
Otherwise, let α = (b − a)/(c − a). We have that b = (1 − α) · a + α · c, and by the convexity of P, we have (1 − α)A + αC ⊆ B.
Thus, by Theorem 14.1.2 we have
v(b) = Vol(B)1/n ≥ Vol((1 − α)A + αC)1/n ≥ Vol((1 − α)A)1/n + Vol(αC)1/n
= (1 − α) · Vol(A)1/n + α · Vol(C)1/n
≥ (1 − α)v(a) + αv(c).
Namely, v(·) is concave on I, and in particular v(b) ≥ min(v(a), v(c)), which in turn implies that Vol(B) = v(b)n ≥ min(Vol(A), Vol(B)),
as claimed.
q
Corollary 14.1.6 For A and B compact sets in IRn , we have Vol((A + B)/2) ≥ Vol(A) Vol(B).
p
Proof: Vol((A + B)/2)1/n = Vol(A/2 +√B/2)1/n ≥ Vol(A/2)1/n + Vol(B/2)1/n = (Vol(A)1/n + Vol(B)n )/2 ≥ Vol(A)1/n Vol(B)1/n
by Theorem 14.1.2, and since (a + b)/2 ≥ ab for any a, b ≥ 0. The claim now follows by raising this inequality to the power n.
134
14.1.1 The Isoperimetric Inequality
Useless Stuff
The following is not used anywhere else and is provided because of its mathematical elegance. The skip-able reader can thus skip
ning!!!
this section.
The isoperimetric inequality states that among all convex bodies of a fixed surface area, the ball has the largest volume (in
particular, the unit circle is the largest area planar region with perimeter 2π). This problem can be traced back to antiquity, in
particular Zenodorus (200–140 BC) wrote a monograph (which was lost) that seemed to have proved the claim in the plane for
some special cases. The first formal proof for the planar case was done by Steiner in 1841. Interestingly, the more general claim is
an easy consequence of the Brunn-Minkowski inequality.
Let K be a convex body in IRn and b = bn be the n dimensional ball of radius one centered at the origin. Let S(X) denote the
surface area of a compact set X ⊆ IRn . The isoperimetric inequality states that
!1/n !1/(n−1)
Vol(K) S(K)
≤ . (14.1)
Vol(b) S(b)
Namely, the left side is the radius of a ball having the same volume as K, and the right side is
the radius of a sphere having the same surface area as K. In particular, if we scale K so that its
surface area is the same as b, then the above inequality implies that Vol(K) ≤ Vol(b).
To prove Eq. (14.1), observe that Vol(b) = S(b) /n . Also, observe that K + ε b is the body
K together with a small “atmosphere” around it of thickness ε. In particular, the volume of this
“atmosphere” is (roughly) ε S(K) (in fact, Minkowski defined the surface area of a convex body
to be the limit stated next). Formally, we have
n
Vol(K + ε b) − Vol(K) Vol(K)1/n + Vol(ε b)1/n − Vol(K)
S(K) = lim ≥ lim ,
ε→0+ ε ε→0+ ε
135
The volume of a ball and the surface area of hypersphere. In fact, let Vol(rbn ) denote the volume of the ball of
radius r in IRn , Area rS(n) denote the surface area of its boundry (i.e., the surface area of rS(n−1) ). It is known that
(n−1)/2
Now, the integral on the right side tends to zero as n increases. In fact, for n very large, the term 1 − xn2 is very close to 0
everywhere except for a small interval around 0. This implies that the main contribution of the volume of the ball happens when
we consider slices of the ball by hyperplanes of the form xn = δ, where δ is small.
If one has to visualize how such a ball in high dimesions looks like, it might be best to think about it as a star-like creature:
It has very little mass close to the tips of any set of orthogonal directions we pick, and most of its mass somehow lies close to its
center.®
Proof: We will prove a slightly weaker bound, with −nt2 /4 in the exponent. Let
b = αx x ∈ A, α ∈ [0, 1] ⊆ bn ,
A
where bn is the unit ball in IRn . We have that Pr[A] = µ A b / Vol(bn )¯
b = Vol A
b , where µ A
(n−1)
Let B = S \ At . We have that ka − bk ≥ t for all a ∈ A and b ∈ B.
a + b
t2
Lemma 14.2.2 For any b b and b
a∈A B, we have
b∈b
≤ 1 − .
2 8
a = αa and b
Proof: Let b b = βb, where a ∈ A and b ∈ B. We have
s r a
{
a + b
a − b
2 t2 t2
kuk =
= 12 −
≤ 1 − ≤ 1 − , (14.2)
2 2 4 8
≤
u
2
since ka − bk ≥ t. As for b
b
h
by Eq. (14.2) and since kbk = 1. Now, τ is a convex combination of the two numbers 1/2 and 1 − t2 /8. In particular, we
conclude that τ ≤ max(1/2, 1 − t2 /8) ≤ 1 − t2 /8, since t ≤ 2.
®
In short, it looks like a Boojum [Car76].
¯
This is one of these “trivial” claims that might give the reader a pause, so here is a formal proof. Pick a random point p uniformly
b Clearly, Vol(A)
inside the ball bn . Let ψ be the probability that p ∈ A. b n
= ψ Vol(b ). So, consider theh normalized
i point q = p/ kpk.
b b b b
Clearly, p ∈ A if and only if q ∈ A, by the definition of A. Thus, µ A = Vol(A)/ Vol(bn ) = ψ = Pr p ∈ A b = Pr[q ∈ A] = Pr[A],
since q has a uniform distribution on the hypersphere by the symmetry of bn .
136
By Lemma 14.2.2, the set A b+ bB /2 is contained in a ball of radius ≤ 1 − t2 /8 around the origin. Applying the Brunn-
Minkowski inequality in the form of Corollary 14.1.6, we have
!n q
t2 b+ b
A B p p
1− ≥ µ bµb
≥ µ A B = Pr[A] Pr[B] ≥ Pr[B] /2.
8 2
Lemma 14.3.1 We have Pr f < med( f ) ≤ 1/2 and Pr f > med( f ) ≤ 1/2.
S
Proof: Since k≥1 (−∞, med( f ) − 1/k] = (−∞, med( f )), we have
" #
1 1
Pr f < med( f ) = sup Pr f ≤ med( f ) − ≤ .
k≥1 k 2
Definition 14.3.2 (c-Lipschitz) A function f : A → B is c-Lipschitz if, for any x, y ∈ A, we have k f (x) − f (y)k ≤ c kx − yk.
Theorem 14.3.3 (Lévy’s Lemma.) Let f : S(n−1) → IR be 1-Lipschitz. Then for all t ∈ [0, 1],
Pr f > med( f ) + t ≤ 2 exp −t2 n/2 and Pr f < med( f ) − t ≤ 2 exp −t2 n/2 .
Proof: We prove only the first inequality, the second follows by symmetry. Let
A = x ∈ S(n−1) f (x) ≤ med( f ) .
be the length of the projection of x into the subspace formed by the first k coordinates. Let x be a vector randomly chosen with
uniform distribution from S(n−1) . Then f (x) is sharply concentrated. Namely, there exists m = m(n, k) such that
Pr f (x) ≥ m + t ≤ 2 exp(−t2 n/2) and Pr f (x) ≤ m − t ≤ 2 exp(−t2 n/2).
√
Furthermore, for k ≥ 10 ln n, we have m ≥ 12 k/n.
Proof: The orthogonal projection p : `2n → `2k given by p(x1 , . . . , xn ) = (x1 , . . . , xk ) is 1-Lipschitz (since projections can only
shrink distances, see Exercise 14.7.4). As such, f (x) = kp(x)k is 1-Lipschitz, since for any x, y we have
| f (x) − f (y)| = kp(x)k − kp(y)k ≤ kp(x) − p(y)k ≤ kx − yk ,
by the triangle inequality and since p is 1-Lipschitz. Theorem 14.3.3 (i.e., Lévy’s lemma) gives the required tail estimate with
m = med( f ). h i
Thus, we only need to prove the lower bound on m. For a random x = (x1 , . . . , xn ) ∈ S(n−1) , we have E kxk2 = 1. By linearity
h i h Pn i Pn h i h i h i
of expectations, and symmetry, we have 1 = E kxk2 = E i=1 xi2 = i=1 E xi2 = n E x2j , for any 1 ≤ j ≤ n. Thus, E x2j = 1/n,
h i
for j = 1, . . . , n. Thus, E ( f (x))2 = k/n. We next use the fact that f is concentrated, to show that f 2 is also relatively concentrated.
For any t ≥ 0, we have
k h i
= E f 2 ≤ Pr f ≤ m + t (m + t)2 + Pr f ≥ m + t · 1 ≤ 1 · (m + t)2 + 2 exp(−t2 n/2),
n
137
√
since f (x) ≤ 1, for any x ∈ S(n−1) . Let t = k/5n. Since k ≥ 10 ln n, we have that 2 exp(−t2 n/2) ≤ 2/n. We get that
k p 2
≤ m + k/5n + 2/n.
n
√ √ √ √ 1
√
Implying that (k − 2)/n ≤ m + k/5n, which in turn implies that m ≥ (k − 2)/n − k/5n ≥ 2
k/n.
At this point, we would like to flip Lemma 14.4.1 around, and instead of randomly picking a point and projecting it down to
the first k-dimensional space, we would like x to be fixed, and randomly pick the k-dimensional subspace. However, we need to
pick this k-dimensional space carefully, so that if we rotate this random subspace, by a transformation T , so that it occupies the first
k dimensions, then the point T (x) is uniformly distributed on the hypersphere.
To this end, we would like to randomly pick a random rotation of IRn . This is an orthonormal matrix with determinant 1. We
can generate such a matrix, by randomly picking a vector e1 ∈ S(n−1) . Next, we set e1 is the first column of our rotation matrix, and
generate the other n − 1 columns, by generating recursively n − 1 orthonormal vectors in the space orthogonal to e1 .
Generating a random vector from the unit hypersphere, and a random rotation. At this point, the reader might
wonder how do we pick a point uniformly from the unit hypersphere. The idea is to pick a point from the multi-
dimensional normal distribution N d (0, 1), and normalizing it to have length 1. Since the multi-dimensional normal
distribution has the density function
(2π)−n/2 exp −(x12 + x22 + · · · + xn2 )/2 ,
which is symmetric (i.e., all the points in distance r from the origin has the same distribution), it follows that this
indeed randomly generates a point randomly and uniformly on S(n−1) .
Generating a vector with multi-dimensional normal distribution, is no more than picking each coordinate accord-
ing to the normal distribution. Given a source of random numbers according to the uniform distribution, this can be
done using a O(1) computations, using the Box-Muller transformation [BM58].
Since projecting down n-dimensional normal distribution to the lower dimensional space yields a normal dis-
tribution, it follows that generating a random projection, is no more than randomly picking n vectors according to
the multidimensional normal distribution v1 , . . . , vn . Then, we orthonormalize them, using Graham-Schmidt, where
vb1 = v1 / kv1 k, and bvi is the normalized vector of vi − wi , where wi is the projection of vi to the space spanned by
v1 , . . . , vi−1 .
Taking those vectors as columns of a matrix, generates a matrix A, with determinant either 1 or −1. We multiply
one of the vectors by −1 if the determinant is −1. The resulting matrix is a random rotation matrix.
Definition 14.4.2 The mapping f : IRn → IRk is called K-bi-Lipschitz for a subset X ⊆ IRn if there exists a constant c > 0 such that
cK −1 · kp − qk ≤ k f (p) − f (q)k ≤ c · kp − qk ,
for all p, q ∈ X.
The least K for which f is K-bi-Lipschitz is called the distortion of f , and is denoted dist( f ). We will refer to f as a K-
embedding of X.
Theorem 14.4.3 (Johnson-Lindenstrauss lemma.) Let X be an n-point set in a Euclidean space, and let ε ∈ (0, 1] be given. Then
there exists a (1 + ε)-embedding of X into IRk , where k = O(ε−2 log n).
Proof: Let X ⊆ IRn (if X lies in higher dimensions, we can consider it to be lying in the span of its points, if it is in lower
dimensions, we can add zero coordinates). Let k = 200ε−2 ln n. Assume k < n, and let F be a random k-dimensional linear
subspace of IRn . Let PF : IRn → F be the orthogonal projection operator of IRn into F. Let m be the number around which kPF (x)k
is concentrated, for x ∈ S(n−1) , as in Lemma 14.4.1.
Fix two points x, y ∈ IRn , we prove that
ε ε
1− m kx − yk ≤ kPF (x) − PF (y)k ≤ 1 + m kx − yk
3 3
holds with probability ≥ 1 − n−2 . Since there are n2 pairs of points in X, it follows that with constant probability this holds for all
pair of points of X. In such a case, the mapping p is D-embedding of X into IRk with D ≤ 1+ε/3 ≤ 1 + ε, for ε ≤ 1.
1−ε/3
Let u = x − y, we have PF (u) = PF (x) − PF (y) since PF (·) is a linear operator. Thus, the condition becomes 1 − 3ε m kuk ≤
kPF (u)k ≤ 1 + 3ε m kuk. Since this condition is scale independent, we can assume kuk = 1. Namely, we need to show that
kPF (u)k − m ≤ ε m.
3
138
By Lemma 14.4.1 (exchanging the random space with the random vector), for t = εm/3, we have that the probablity that this does
not hold is bounded by ! ! !
t2 n −ε2 m2 n ε2 k
4 exp − = 4 exp ≤ 4 exp − < n−2 ,
2 18 72
√
since m ≥ 21 k/n.
Lemma 14.5.2 The following properties hold for the d dimensional Gaussian distribution N d (0, 1):
(i) The distribution N d (0, 1) is centrally symmetric around the origin.
(ii) If X ∼ N d (0, 1) and u is a unit vector, then X · u ∼ N(0, 1).
(iii) If X, Y ∼ N(0, 1) are two independent variables, then Z = X 2 + Y 2 follows the exponential distribution with parameter
λ = 21 .
(iv) Given k independent variables X1 , . . . , Xk distributed according to the exponential distribution with parameter λ, then
Y = X1 + · · · + Xk is distributed according to the Gamma distribution Γλ,k (x).
Proof: (i) Let x = (x1 , . . . , xd ) be a point picked from the Gaussian distribution. √ The density φd (x) = φ(x1 )φ(x2 ) · φ(xd ), where
φ(xi ) is the normal distribution density function, which is φ(xi ) = exp(−xi2 /2)/ 2π. Thus φd (x) = (2π)−n/2 exp(−(x12 · · · + xd2 )/2).
Consider any two points x, y ∈ IRn , such that r = kxk = kyk. Clearly, φd (x) = φd (y). Namely, any two points of the same distance
from the origin, have the same density (i.e., “probability”). As such, the distribution N d (0, 1) is centrally symmetric around the
origin.
(ii) Consider e1 = (1, 0, . . . , 0) ∈ IRn . Clearly, x · e1 = x1 , which is distributed N(0, 1). Now, by the symmetry of N d (0, 1),
this implies that x · u is distributed N(0, 1). Formally, let R be a rotation matrix that maps u to e1 . We know that Rx is distributed
N d (0, 1) (since N d (0, 1) is centrally symmetric). Thus x · u has the same distribute as Rx · Ru, which has the same distribution as
x · e1 , which is N(0, 1).
(iii) If X, Y ∼ N(0, 1), and consider the integral of the density function
Z ∞ Z ∞ !
1 x2 + y2
A= exp − dx dy.
x=−∞ y=−∞ 2π 2
√ √
We would like to change the integration variables to x(r, α) = r sin α and y(r, α) = r cos α. The Jacobian of this change of
variables is sin α √
∂x
∂x
∂r ∂α
2 √r r cos α 1 1
I(r, α) = ∂y ∂y = cos α √ = − sin2 α + cos2 α = − .
∂r ∂α 2 r − r sin α
√ 2 2
As such, we have
Z !
1 x2 + y2
Pr[Z = z] = exp −
x2 +y2 =α 2π 2
Z 2π √ √ !
1 x( z, α)2 + y( z, α)2
= exp − · |I(r, α)|
α=0 2π 2
Z 2π z 1 z
1 1
= · · exp − = exp − .
2π 2 α=0 2 2 2
139
As such, Z has an exponential distribution with λ = 1/2.
k−2
(iv) For k = 1 the claim is trivial. Otherwise, let gk−1 (x) = λ (λx)
(k−2)!
exp(−λx). Observe that
Z t Z t !
(λ(t − x))k−2
gk (t) = gk−1 (t − x)g1 (x) dx = λ exp(−λ(t − x)) λ exp(−λx) dx
0 0 (k − 2)!
Z t
(λ(t − x))k−2
= λ2 exp(−λt) dx
0 (k − 2)!
Z t
(λx)k−2 (λt)k−1
= λ exp(−λt) λ dx = λ exp(−λt) = gk (x).
0 (k − 2)! (k − 1)!
2
Proof: By Lemma 14.5.2 (ii) each Xi is distributed as N(0, 1), and X1 , . . . , Xk are independent. Define Yi = X2i−1 + X2i2 , for
P
i = 1, . . . , τ, where τ = k/2. By Lemma 14.5.2 (iii) Yi follows the exponential distribution with parameter λ = 1/2. Let L = τi=1 Yi .
Pk/2
By Lemma 14.5.2 (iv), the variable L follows the Gamma distribution (k/2, 1/2), and its expectation is E[L] = i=1 E[Yi ] = 2τ = k.
Now, let η = βτ = βk/2, we have
X ηi ητ
τ
Pr L ≥ βk = 1 − Pr L ≤ βk = 1 − G1/2,τ (βk) = e−η ≤ (τ + 1)e−η ,
i=0
i! τ!
since η = βτ > τ, as β > 1. Now, since τ! ≥ (τ/e)τ , as can be easily verified° , and thus
ητ eη τ eβτ τ
Pr L ≥ βk ≤ (τ + 1)e−η = (τ + 1)e−η = (τ + 1)e−βτ
ττ /eτ τ τ
= (τ + 1)e−βτ · exp(τ ln(eβ)) = (τ + 1) exp(−τ(β − (1 + ln β)))
!
k+3 k
= exp − (β − (1 + ln β)) .
2 2
since (eτ/τβ)τ ≥ 1/(2β)ν . As the sequence (eτ/iβ)i is decreasing for i > τ/β, as can be easily verified± , we can bound the
(decreasing) summation above by
Xν !i !τ
eτ e
≤ν = d2eτe exp(τ(1 − ln β)) .
i=τ
iβ β
We conclude !
k −1
Pr L ≤ k/β ≤ 2 d2eτe exp(−τ/β + τ(1 − ln β)) ≤ 6k exp − (β − (1 − ln β)) .
2
Pτ Rn n
°
Indeed, ln τ! = i=1 ln i ≥ x=1
ln x dx = x ln x − x = n ln n − n + 1 ≥ n ln n − n = ln((n/e)n ).
x=1
±
Indeed, consider the function f (x) = x ln(c/x), its derivative is f 0 (x) = ln(c/x) − 1, and as such f 0 (x) = 0, for x = c/e. Namely,
for c = eτ/β, the function f (x) achieves its maximum at x = τ/β, and from this point on the function is decreasing.
140
Next, we show how to interpret the inequalities of Lemma 14.5.3 in a somewhat more intuitive way. Let β = 1 + ε, for ε such
P (−1)i i+1
that 1 > ε > 0. From the Taylor expansion of ln(1 + x) = ∞ i=0 i+1 x , it follows that ln β ≤ ε − ε2 /2 + ε3 /3. By plugging it into
the upper bound for Pr L ≥ βk we get
! !
k+3 k k+3 k
Pr L ≥ βk ≤ exp − (1 + ε − 1 − ε + ε2 /2 − ε3 /3) ≤ exp − (ε2 /2 − ε3 /3)
2 2 2 2
2
On the other hand, since ln β ≥ ε − ε /2, we have Pr L ≤ k/β ≤ 6k exp(∆), where
!
k k 1 ε2
∆ = − (β−1 − (1 − ln β)) ≤ − −1+ε−
2 2 1+ε 2
!
k ε2 ε2 k ε2 − ε3
≤ − − =− ·
2 1+ε 2 2 2(1 + ε)
Thus, the probability that a given unit vector gets distorted by more than (1+ε) in any direction² grows roughly as exp(−kε2 /4),
for small ε > 0. Therefore, if we are given a set P of n points in l2 , we can set k to roughly 8 ln(n)/ε2 and make sure that with
non-zero probability we obtain projection which does not distort distances³ between any two different points from P by more than
(1 + ε) in each direction.
Theorem 14.5.4 Let P be a set of n points in IRd , 0 < ε, δ < 1/2, and k = 16 ln(n/δ)/ε2 . Let U1 , . . . , Uk be random vectors chosen
independently from the d-dimensional Gaussian distribution N d (0, 1), and let T (x) = (U1 · x, . . . , Uk · x) be a linear transformation.
Then, with probablity ≥ 1 − δ, for any p, q ∈ P, we have that
1 √ √
k kp − qk ≤ kT (p) − T (q)k ≤ (1 + ε) k kp − qk .
(1 + ε)
Corollary 14.5.5 Let k be the target dimension of the transformation T of Theorem 14.5.4 and β ≥ 3 a parameter. We have that
√
kT (p) − T (q)k ≤ β k kp − qk ,
2
for any two points p, q ∈ P, and this holds with probability ≥ 1 − exp − 32kβln n .
141
In fact, the random embedding preserves much more structure than just distances between points. It preserves the structure
and distances of surfaces as long as they are low dimensional and “well behaved”, see [AHY07] for some results in this direction.
Dimension reduction is crucial in learning, AI, databases, etc. One common technique that is being used in practice is to
do PCA (i.e., principal component analysis) and take the first few main axises. Other techniques include independent component
analysis, and MDS (multidimensional scaling). MDS tries to embed points from high dimensions into low dimension (d = 2 or 3),
which preserving some properties. Theoretically, dimension reduction into really low dimensions is hopeless, as the distortion in
the worst case is Ω(n1/(k−1) ), if k is the target dimension [Mat90].
14.7 Exercises
Exercise 14.7.1 (Boxes can be separated.) [1 Points]
(Easy.) Let A and B be two axis-parallel boxes that are interior disjoint. Prove that there is always an axis-parallel hyperplane
that separates the interior of the two boxes.
Corollary 14.7.3 For A and B compact sets in IRn , we have for any λ ∈ [0, 1] that Vol(λA + (1 − λ)B) ≥
Vol(A)λ Vol(B)1−λ .
142
Chapter 15
As we saw in previous lectures, all we need to solve (1 + ε)-ANN, is it is enough to efficiently solve the approximate near
neighbor problem. Namely, given a set P of n points in Hd , and radius r > 0 and parameter ε > 0, we want to decide for a query
point q whether dH (q, P) ≤ r or dH (q, P) ≥ (1 + ε)r.
Definition 15.1.2 For a set P of points, a data-structure NNbr≈ (P, r, (1 + ε)r) solves the approximate near neighbor problem, if
given a query point q, the data-structure works as follows.
• If d(q, P) ≤ r then NNbr≈ outputs a point p ∈ P such that d(p, q) ≤ (1 + ε)r.
• If d(q, P) ≥ (1 + ε)r, in this case NNbr≈ outputs that “d(q, P) ≥ r”.
• If r ≤ d(q, P) ≤ (1 + ε)r, either of the above answers is acceptable.
Given such a data-structure NNbr≈ (P, r, (1 + ε)r), one can construct a data-structure that answers ANN using O(log(n/ε))
queries.
Definition 15.1.3 Let U be a (small) positive integer. A family F = {h : S → [0, U]} of functions, is an (r, R, α, β)-sensitive if for
any u, q ∈ S , we have:
• If u ∈ b(q, r) then Pr[h(u) = h(v)] ≥ α.
• If u < b(q, R) then Pr[h(u) = h(v)] ≤ β,
where h is randomly picked from F, r < R, and α > β.
Intuitively, if we can construct a (r, R, α, β)-sensitive family, then we can distinguish between two points which are close
together, and two points which are far away from each other. Of course, the probabilities α and β might be very close to each other,
and we need a way to do amplification.
Lemma 15.1.4 For the hypercube Hd = {0, 1}d , and a point b = (b1 , . . . , bd ) ∈ Hd , let F be the set of functions
hi (b) = bi b = (b1 , . . . , bd ) ∈ Hd , for i = 1, . . . , d .
r(1+ε)
Then for any r, ε, the family F is r, (1 + ε)r, 1 − dr , 1 − d
-sensitive.
143
Proof: If u, v ∈ {0, 1}d are in distance smaller than r from each other (under the Hamming distance), then they differ in at most
r coordinates. The probability that h ∈ F would project into a coordinate that u and v agree on is ≥ 1 − r/d.
Similarly, if dH (u, v) ≥ (1 + ε)r then the probability that h would map into a coordinate that u and v agree on is ≤ 1 − (1 + ε)r/d.
Intuitively, G is a family that extends F by probing into k coordinates instead of only one coordinate.
Lemma 15.1.5 If thereis a (r, (1+ε)r, α, β)-sensitive family F of functions for the hypercube, then there exists a NNbr≈ (P, r, (1+ε)r)
which uses O dn + n1+ρ space and O(nρ ) hash probes for each query, where
ln 1/α
ρ= .
ln 1/β
This data-structure succeeds with constant probability.
Proof: It suffices to ensure that properties (1) and (2) holds with probability larger than half.
ln n
Set k = log1/β n = ln(1/β) , then the probability that for a random hash function g ∈ G(F), we have g(p) = g(q) for p ∈
P \ b(q, (1 + ε)r) is at most
1
Pr g(p0 ) = g(q) ≤ βk ≤ exp ln(β) · ln(1/β)ln n
≤ .
n
Thus, the expected number of elements from P \ b(q, (1 + ε)r) colliding with q in the jth hash table H j is bounded by one. In
particular, the overall expected number of such collisions in H1 , . . . , Hτ is bounded by τ. By the Markov inequality we have that
the probability that the collusions number exceeds 4τ is less than 1/4; therefore the probability that the property (2) holds is ≥ 3/4.
Next, for a point p ∈ b(q, r), consider the probability of g j (p) = g j (q), for a fixed j. Clearly, it is bounded from below by
ln 1/α
≥ αk = αlog1/β n = n− ln 1/β = n−ρ .
Thus the probability that such a g j exists is at least 1 − (1 − n−ρ )τ . By setting τ = 2nρ we get property (1) holds with probability
≥ 1 − 1/e2 > 4/5. The claim follows.
ln(1 − x) 1
Claim 15.1.6 For x ∈ [0, 1) and t ≥ 1 such that 1 − tx > 0 we have ≤ .
ln(1 − tx) t
Proof: Since ln(1 − tx) < 0, it follows that the claim is equivalent to t ln(1 − x) ≥ ln(1 − tx). This in turn is equivalent to
This is trivially true for x = 0. Furthermore, taking the derivative, we see g0 (x) = −t + t(1 − x)t−1 , which is non-positive for x ∈ [0, 1)
and t ≥ 1. Therefore, g is non-increasing in the region in which we are interested, and so g(x) ≤ 0 for all values in this interval.
Lemma 15.1.7 There exists a NNbr≈ (r, (1 + ε)r) which uses O dn + n1+1/(1+ε) space and O(n1/(1+ε) ) hash probes for each query.
The probability of success (i.e., there is a point u ∈ P such that dH (u, q) ≤ r, and we return a point v ∈ P such that kuvk ≤ (1 + ε)r)
is a constant.
144
r r(1+ε)
Proof: By Lemma 15.1.4, we have a (r, (1 + ε)r, α, β)-sensitive family of hash functions, where α = 1 − d
and β = 1 − d
.
As such
r
ln 1/α ln α ln d−r
d
ln 1 − d 1
ρ= = = d−(1+ε)r = ≤ ,
ln 1/β ln β ln ln 1 − (1 + ε) r 1 + ε
d d
by Claim 15.1.6.
By building O(log n) structures of Lemma 15.1.7, we can do probability amplification and get a correct result with High
probability.
Theorem 15.1.8 Given a set P of n points on the hypercube Hd , parameters ε > 0 and r > 0, one can build a NNbr≈ =
NNbr≈ (P, r, (1 + ε)r), such that given a query point q, one can decide if:
• b(q, r) ∩ P , ∅, then NNbr≈ returns a point u ∈ P, such that dH (u, q) ≤ (1 + ε)r.
• b(q, (1 + ε)r) ∩ P = ∅ then NNbr≈ returns that no point is in distance ≤ r from q.
In any other case, any of the answers is correct. The query time is O(dn1/(1+ε) log n) and the space used is O dn + n1+1/(1+ε) log n .
The result returned is correct with high probability.
Proof: Note, that every point can be stored only once. Any other reference to it in the data-structure can be implemented with a
pointer. Thus, the O(dn) requirement on the space. The other term follows by repeating the space requirement of Lemma 15.1.7
O(log n) times.
In the hypercube case, when d = nO(1) , we can just build M = O(ε−1 log n) such data-structures such that (1 + ε)-ANN can be
answered using binary search on those data-structures, which corresponds to radiuses r1 , . . . , r M , where ri = (1 + ε)i .
d O(1)
Theorem 15.1.9 Given a set P of n points on the
hypercube H (where d = n ) parameters ε > 0 and r > 0, one can build ANN
1/(1+ε) −1 2
data-structure using O (d + n )ε n log n space, such that given a query point q, one can returns an ANN in P (under the
Hamming distance) in O(dn1/(1+ε) log(ε−1 log n)) time. The result returned is correct with high probability.
Proof: If kvk = 1 then this holds by the symmetry of the normal distribution. Indeed, let e1 = (1, 0, . . . , 0). By the symmetry of
the d-dimensional normal distribution, we have that v · X ∼ e1 · X =
X1 ∼ N(0, 1).
Otherwise, v · X/ kVk ∼ N(0, 1), and as such v · X ∼ N 0, kvk2 , which is indeed the distribution of kvk Z.
A d-dimensional distribution that has the property of Lemma 15.2.1, is called a 2-stable distribution.
If p and q are in distance η from each other, and when we project to ~v, the distance between the projection is t, then the
probability that they get the same hash value is 1 − t/r, since this is the probability that the random sliding will not separate them.
As such, we have that the probability of collusion is
Z r i
h t
α(η) = Pr h(p) = h(q) = Pr p · ~v − q · ~v = t 1 − dt.
t=0 r
However, since ~v is chosen from a 2-stable distribution, we have that p ·~v − q ·~v = (p − q) ·~v ∼ N(0, kpqk2 ). Since we are considering
the absolute value of the variable, we need to multiply this by two. Thus, we have
Z r !
2 t2 t
α(η, r) = √ exp − 1− dt.
t=0 2πη 2η 2 r
145
Intuitively, we care about the difference α(1 + ε, r) − α(1, r), and we would like to maximize it as much as possible (by choosing
the right value of r). Unfortunately, this integral is unfriendly, and we have to resort to numerical computation.
In fact, if are going to use this hashing scheme for constructing locality sensitive hashing, like in Lemma 15.1.5, then we care
about the ratio
log(1/α(1))
ρ(1 + ε) = min .
r log(1/α(1 + ε))
1
Lemma 15.2.2 ([DNIM04]) One can choose r, such that ρ(1 + ε) ≤ 1+ε
.
Lemma 15.2.2 implies that the hash functions defined by Eq. (15.1) are (1, 1 + ε, α0 , β0 )-sensitive, and furthermore, ρ =
log(1/α0 ) 1
≤ 1+ε
log(1/β0 )
, for some values of α0 and β0 . As such, we can use this hashing family to construct NNbr≈ for the set P of points in
d
IR . Following the same argumentation of Theorem 15.1.8, we have the following.
Theorem 15.2.3 Given a set P of n points in IRd , parameters ε > 0 and r > 0, one can build a NNbr≈ = NNbr≈ (P, r, (1 + ε)r), such
that given a query point q, one can decide if:
• b(q, r) ∩ P , ∅, then NNbr≈ returns a point u ∈ P, such that dH (u, q) ≤ (1 + ε)r.
• b(q, (1 + ε)r) ∩ P = ∅ then NNbr≈ returns that no point is in distance ≤ r from q.
In any other case, any of the answers is correct. The query time is O(dn1/(1+ε) log n) and the space used is O dn + n1/(1+ε) n log n .
The result returned is correct with high probability.
Theorem 15.2.4 Given a set P of n points in IRd , then one can construct data-structures D that answers (1 + ε)-ANN queries, by
performing O(log(n/ε)) NNbr≈ queries. The total number of points stored at NNbr≈ data-structures of D is O(nε−1 log(n/ε)).
Constructing the data-structure of Theorem 15.2.4 requires building a low quality HST. Unfortunately, the previous construc-
tion seen for HST are exponential in the dimension, or take quadratic time. We next present a faster scheme.
Proof: Our construction is based on a recursive decomposition of the point-set. In each stage, we split the point-set into two
subsets. We recursively compute a nd-HST for each point-set, and we merge the two trees into a single tree, by creating a new
vertex, assigning it an appropriate value, and hung the two subtrees from this node. To carry this out, we try to separate the set into
two subsets that are furthest away from each other.
P
Let R = R(P) be the minimum axis parallel box containing P, and let ν = l(P) = di=1 kIi (R)k, where Ii (R) is the projection of
R to the ith dimension.
Clearly, one can find an axis parallel strip H of width ≥ ν/((n − 1)d), such that there is at least one point of P on each of its
sides, and there is no points of P inside H. Indeed, to find this strip, project the point-set into the ith dimension, and find the longest
interval between two consecutive points. Repeat this process for i = 1, . . . , d, and use the longest interval encountered. Clearly, the
strip H corresponding to this interval is of width ≥ ν/((n − 1)d). On the other hand, diam(P) ≤ ν.
Now recursively continue the construction of two trees T + , T − , for P+ , P− , respectively, where P+ , P− is the splitting of P into
two sets by H. We hung T + and T − on the root node v, and set ∆v = ν. We claim that the resulting tree T is a nd-HST. To this end,
observe that diam(P) ≤ ∆v , and for a point p ∈ P− and a point q ∈ P+ , we have kpqk ≥ ν/((n − 1)d), which implies the claim.
To construct this efficiently, we use an efficient search trees to store the points according to their order in each coordinate. Let
D1 , . . . , Dd be those trees, where Di store the points of P in ascending order according to the ith axis, for i = 1, . . . , d. We modify
them, such that for every node v ∈ Di , we know what is the largest empty interval along the ith axis for the points Pv (i.e., the points
stored in the subtree of v in Di ). Thus, finding the largest strip to split along, can be done in O(d log n) time. Now, we need to
split the d trees into two families of d trees. Assume we split according to the first axis. We can split D1 in O(log n) time using the
splitting operation provided by the search tree (Treaps for example can do this split in O(log n) time). Let assume that this split P
into two sets L and R, where |L| < |R|.
We still need to split the other d − 1 search trees. This is going to be done by deleting all the points of L from those trees, and
building d − 1 new search trees for L. This takes O(|L| d log n) time. We charge this work to the points of L.
146
Since in every split, only the points in the smaller portion of the split get charged, it follows that every point can be charged at
most O(log n) time during this construction algorithm. Thus, the overall construction time is O(dn log2 n) time.
Theorem 15.2.6 Given a set P of n points in IRd , parameters ε > 0 and r > 0, one can build ANN data-structure using
O dn + n1+1/(1+ε) ε−2 log3 (n/ε)
space, such that given a query point q, one can returns an (1 + ε)-ANN in P in
n
O dn1/(1+ε) log n log
ε
Proof: We compute the low quality HST using Lemma 15.2.5. This takes O(nd log2 n) time. Using this HST, we can construct
the data-structure D of Theorem 15.2.4, where we do not compute the NNbr≈ data-structures. We next traverse the tree D, and
construct the NNbr≈ data-structures using Theorem 15.2.3.
We only need to prove the bound on the space. Observe, that we need to store each point only once, since other place can
refer to the point by a pointer. Thus, this is the O(nd) space requirement. The other term comes from plugging the bound of
Theorem 15.2.4 into the bound of Theorem 15.2.3.
147
From approximate near-neighbor in IRd to approximate near-neighbor on the hypercube. The reduction is
quite involved, and we only sketch the details. Let P Be a set of n points in IRd . We first reduce the dimension
0
to k = O(ε−2 log n) using the Johnson-Lindenstrauss lemma. Next, we embed this space into `1k (this is the space
k 0 2
IR , where distances are the L1 metric instead of the regular L2 metric), where k = O(k/ε ). This can be done with
distortion (1 + ε).
0
Let Q the resulting set of points in IRk . We want to solve NNbr≈ on this set of points, for radius r. As a first step,
we partition the space into cells by taking a grid with sidelength (say) k0 r, and randomly translating it, clipping the
points inside each grid cell. It is now sufficient to solve the NNbr≈ inside this grid cell (which has bounded diameter as
a function of r), since with small probability that the result would be correct. We amplify the probability by repeating
this polylogarithmic number of times.
0
Thus, we can assume that P is contained inside a cube of side length ≤ k0 nr, and it is in IRk , and the distance
metric is the L1 metric. We next, snap the points of P to a grid of sidelength (say) εr/k0 . Thus, every point of P now
has an integer coordinate, which is bounded by a polynomial in log n and 1/ε. Next, we write the coordinates of the
points of P using unary notation. (Thus, a point (2, 5) would be written as (010, 101) assuming the number of bits for
each coordinates is 3.) It is now easy to verify that the hamming distance on the resulting strings, is equivalent to the
L1 distance between the points.
Thus, we can solve the near-neighbor problem for points in IRd by solving it on the hypercube under the Hamming
distance.
See Indyk and Motwani [IM98] for more details.
This relationship indicates that the ANN on the hypercube is “equivalent” to the ANN in Euclidean space. In particular, making
progress on the ANN on the hypercube would probably lead to similar progress on the Euclidean ANN problem.
We had only scratched the surface of proximity problems in high dimensions. The interested reader is referred to the survey
by Indyk [Ind04] for more information.
148
Chapter 16
16.2 Ellipsoids
Definition 16.2.1 Let b = x kxk ≤ 1 be the unit ball. Let a ∈ IRn be a vector and let T : IRn → IRn be an invertible linear
transformation. The set
E = T (b) + a
is called an ellipsoid and a is its center.
However,
−1
D E T
T (x − a)
= T −1 (x − a), T −1 (x − a) = (T −1 (x − a))T T −1 (x − a) = (x − a)T T −1 T −1 (x − a).
T
In particular, let Q = T −1 T −1 . Observe that Q is symmetric, and that it is positive definite. Thus,
E = x ∈ IRn (x − a)T Q(x − a) ≤ 1 .
If we change the basis of IRn to be the set of unit eigenvectors of E, then Q becomes a diagonal matrix, and we have that
E = (y1 , . . . , yn ) ∈ IRn λ1 (y1 − a1 )2 + · · · + λn (yn − an )2 ≤ 1 ,
√
where a = (a1 , . . . , an ) and λ1 , . . . , λn are the eigenvalues of Q. In particular, this implies that the point a1 , . . . , ai ± 1/ λi , . . . , an ∈
∂E, for i = 1, . . . , n. In particular,
Vol(b) Vol(b)
Vol(E) = √ √ = √ .
λ1 · · · λn det(Q)
For a convex body K (i.e., a convex and bounded set), let E be a largest volume ellipsoid contained inside K. One can show
that E is unique. Namely, there is a single maximum volume ellipsoid inside K.
149
Theorem 16.2.2 Let K⊂ IR n be aconvex body, and let E ⊆ K be its maximum volume ellipsoid. Suppose that E is centered at the
origin, then K ⊆ n E = nx x ∈ E .
Proof: By applying a linear transformation, we can assume that E is the unit ball b. And assume for the sake of contradiction
that there is a point p ∈ K, such that kpk > n. Consider the set C which is the convex-hull of {p} ∪ b. Since K is convex, we have
that C ⊆ K.
We will reach a contradiction, by finding an ellipsoid G which has volume larger than b and is enclosed inside C.
By rotating space, we can assume that the apex p of C is the
point (ρ, 0, . . . , 0), for ρ > n. We consider ellipsoids of the form
β
z α
1 X 2
(y1 − τ)2 n
(−1, 0) o }| {
G= (y1 , . . . , yn ) + 2 yi ≤ 1
.
α 2 β (τ, 0)
i=2
p = (ρ, 0)
We need to pick the values of τ, α and β such that G ⊆ C. Ob-
serve that by symmetry, it is enough to enforce that G ⊆ C in the
first two dimensions.
Thus, we can consider C and G to be in two dimensions. Now, the center of G is going to be on the x-axis, at the point (τ, 0).
The set E is just an ellipse with axises parallel to x and y. In particular, we require that (−1, 0) would be on the boundary of G. This
implies that (−1 − τ)2 = α2 and that 0 ≤ τ ≤ (ρ + (−1))/2. Namely,
α = 1 + τ. (16.1)
In particular, the equation of curve forming (the boundary) of G is
(x − τ)2 y2
F(x, y) = + − 1 = 0.
(1 + τ)2 β2
q
r
`
We next compute the value of β2 . Consider the tangent ` to
the unit circle that passes through p = (ρ, 0). Let q be the point (u, v)
where ` touches the unit circle. See figure on the right. We have
4poq , 4oqr. As such (−1, 0) o
kr − qk kq − ok (τ, 0)
= . p = (ρ, 0)
kq − ok ko − pk
G
Since kq − ok = 1, we have kq − rk = 1/ρ. Furthermore, since q
is on the unit circle, we have C
p
q = 1/ρ, 1 − 1/ρ2 .
As such, the equation of the line ` is
x p 1 ρ
q, (x, y) = 1 ⇒ + 1 − 1/ρ2 y = 1 ⇒ `≡y=−p x− p .
ρ ρ2 − 1 ρ2 − 1
Next, consider the tangent `0 to G at (u, v). We will drive a formula for `0 as a function of (u, v) and then require that ` = `0 . The
slope of the `0 is the slope of the tangent to G at (u, v), which is
! !
dy Fu (u, v) 2(u − τ) 2v β2 (u − τ)
=− =− / =−
dx Fv (u, v) (1 + τ) 2 β 2 v(1 + τ)2
by computing the derivatives of the implicit function F. The line `0 is just
(u − τ) v
(x − τ) + 2 y = 1,
α2 β
β2 (u − τ) β2
since it has the required slope and it passes through (u, v). Namely `0 ≡ y = − (x − τ) + . Setting ` = `, we have that the
vα 2 v
line `0 passes through (ρ, 0). As such,
(ρ − τ)(u − τ) (u − τ) α (u − τ)2 α2
=1 ⇒ = ⇒ = . (16.2)
α2 α ρ−τ α2 (ρ − τ)2
150
Since ` and `0 are the same line, we have that
β2 (u − τ) 1
= p .
vα2 ρ2 − 1
However,
β2 (u − τ) β2 (u − τ) β2 α β2
= · = · = .
vα2 vα α vα ρ − τ v(ρ − τ)
Thus,
β2 ρ−τ
= p .
v ρ2 − 1
v2 ρ2 − 1 v2 ρ2 − 1 (u − τ)2
Squaring and inverting both sides, we have = and thus 2 = β2 . The point (u, v) ∈ ∂G, and as such +
β 4 (ρ − τ)2 β (ρ − τ)2 α2
v2
= 1. Using Eq. (16.2) and the above, we get
β2
α2 ρ2 − 1
+ β2 = 1,
(ρ − τ)2 (ρ − τ)2
and thus
(ρ − τ)2 − α2 (ρ − τ)2 − (τ + 1)2 (ρ − 2τ − 1)(ρ + 1) ρ − 2τ − 1 2τ
β2 = = = = =1− .
ρ2 − 1 ρ2 − 1 ρ2 − 1 ρ−1 ρ−1
Namely, for n ≥ 2, and 0 ≤ τ < (ρ − 1)/2, we have that the ellipsoid G is defined by the parameters α = 1 + τ and β. It is
contained inside the “cone” C. It holds that Vol(G) = βn−1 α Vol(b), and thus
Vol(G) n−1
µ = ln = (n − 1) ln β + ln α = ln β2 + ln α.
Vol(b) 2
2 2 3
For τ > 0 sufficiently small, we haveln α = ln(1
+ τ) = τ + O(τ ), because of the Taylor expansion ln(1 + x) = x − x /2 + x /3 − · · · ,
2 2τ 2τ 2
for −1 < x ≤ 1. Similarly, ln β = ln 1 − ρ−1 = − ρ−1 + O(τ ). Thus,
! !
n−1 2τ n−1
µ= − + O(τ2 ) + τ + O(τ2 ) = 1 − τ + O(τ2 ) > 0,
2 ρ−1 ρ−1
for τ sufficiently small and if ρ > n. Thus, Vol(G)/ Vol(b) = exp(µ) > 1 implying that Vol(G) > Vol(b). A contradiction.
A convex body K
√ centered at the origin is symmetric if p ∈ K, implies that −p ∈ K. Interestingly, the constant in Theorem 16.2.2
can be improved to n in this case. We omit the proof, since it is similar to the proof of Theorem 16.2.2.
Theorem 16.2.3 Let K ⊂ IRn be√a symmetric convex body, and let E ⊆ K be its maximum volume ellipsoid. Suppose that E is
centered at the origin, then K ⊆ n E.
151
152
Chapter 17
In addition, the sirloin which I threw overboard, instead of drifting off into the void, didn’t seem to want to leave
the rocket and revolved about it, a second artificial satellite, which produced a brief eclipse of the sun every eleven
minutes and four seconds. To calm my nerves I calculated till evening the components of its trajectory, as well as
the orbital perturbation caused by the presence of the lost wrench. I figured out that for the next six million years the
sirloin, rotating about the ship in circular path, would lead the wrench, then catch up with it from behind and pass it
again.
– The Star Diaries, Stanislaw Lem.
In this chapter, we will introduce a powerful technique for “structure” approximation. The basic idea is to perform a search by
assigning elements weights, and picking the elements according to their weights. Elements weight indicates their importance. By
repeatedly picking elements according to their weights, and updating the weights of objected that are being neglected (i.e., they are
more important than the current weights indicate), we end up with a structure that has some desired properties.
We will demonstrate this technique for two problems. In the first problem, we will compute a spanning tree of points that
has low stabbing number. In the second problem, we will show how the Set Cover problem can be approximated efficiently in
geometric settings, yielding a better bound than the general approximation algorithm for this problem.
segments of Ei−1 that intersects `. Thus, the weight of a segment s, in the beginning of the ith iteration, is
X
ω s,i = ω`,i−1 .
`∈L,`∩s,∅
153
Clearly, the heavier a segment s is the less desirable it is to be used for the spanning tree. As such, we would always pick an edge qr
such that q, r ∈ P belong to two different connected components of the forest induced by Ei−1 , and its weight is minimal among all
such edges. We repeat this process till we end up with a spanning tree of P. To simplify the implementation of the algorithm, when
adding s to the set of edges in the forest, we also remove one its endpoints from P (i.e., every connect component of our forest has
a single representative point). Thus, the algorithm terminates when P has a single point in it.
We claim that the resulting spanning tree has the required properties. Clearly, the algorithm can be implemented in polynomial
time since it performs n − 1 iterations and as such, the largest weight used is ≤ 2n . But such numbers can be manipulated in
polynomial time, and as such the running time is polynomial.
Lemma 17.1.1 Let P be a set of n points in the plane, and let L be a set of lines in the plane with total weight W. One can always
find a pair of points
√ q and r in P, such that the total weight of the segment s = qr (i.e., the total weight of the lines of L intersecting
s) is at most cW/ n, for some constant c.
Proof: First, since the weights considered by us are always integers, we can consider all the weights to be 1 by replacing a
line ` of weight ω` by ω` copies of it. Perturb slightly the lines, so that there is no pair of them which is parallel. Next, consider a
point q ∈ IR2 , and all the vertices of the arrangement of A = A(L). Consider the ball b(q, r) of all vertices of the arrangement A in
crossing distance at most ≤ r from q.
We claim that |b(q, r)| ≥ r2 /8. Indeed, one can shoot a ray ζ from q that intersects at least W/2 lines of L. Let `1 , . . . , `r/2 be
the first r/2 lines hit by the ray ζ, and let r1 , . . . , rr/2 be the respective intersection points between these lines and ζ. Now, mark all
the intersection points of the arrangement A(L) along the line `i that are in distance at most r/2 from ri , for i = 1, . . . , r/2. Clearly,
we overall marked at least (r/2)(r/2)/2 vertices of the arrangement, since we marked (at least) r/2 vertices along each of the lines
`1 , . . . , `r/2 . Furthermore, each vertex can be counted in this way at most twice. Now, observe that all these vertices are in distance
at most (r/2) + (r/2) from q because of the triangle inequality.
So, consider the set of vertices X(r) = ∪p∈P b(p, r). Clearly, as long the balls of X(r) are disjoint, the number of vertices of
the arrangement A included in X(r) is at least nr2 /8. In particular, the overall number of vertices in the arrangement is W2 , and as
such it must be two balls of X(r) are not disjoint when nr2 /8 > W2 = W(W − 1)/2. Namely, when r2 > 4W 2 /n; namely, when
√ l √ m
r > 2W/ n. Now, when r = 2W/ n + 1 there must be a vertex v in the arrangement A and two points q, r ∈ P, such that
dQ(q, v), dQ(v, r) ≤ r, and by the triangle inequality, we have that
√
dQ(q, r) ≤ dQ(q, v) + dQ(v, r) ≤ O(W/ n).
√ √ Y i
√
Wi ≤ Wi−1 + cWi−1 / ni ≤ (1 + c/ ni )Wi−1 ≤ (1 + c/ nk )W0
k=1
i
Y
i √ X c
≤ W0 exp c/ nk = exp √ ,
k=1 k=1 n−k+1
since 1 + x ≤ e x , for all x ≥ 0. In particular, we have that
n ! n
X c n X c √
Wn ≤ W0 exp √ = exp √ ≤ n2 exp 4c n ,
k=1 n−k+1 2 k=1 k
Pn √ Rn √ √ x=n+1 √
since k=1 1/ k ≤ 1 + x=1
(1/ x)dx = 1 + 2 x ≤ 4 n. On the other direction, consider the heaviest line l in L in the end
x=1
of the execution of the algorithm. If it crosses ∆ lines then its weight is 2∆ , and as such
√
2∆ = ωl ≤ Wn ≤ n2 exp 4c n .
√ √
It follows that ∆ = O(log n + n), as required. Namely, any line in the plane crosses at most O( n) edges of T.
154
Theorem
√ 17.1.2 Given a set P of n points in the plane, one can compute a spanning tree T of P such that each line crosses at most
O( n) edges of T. The running time is polynomial in n.
This result also hold in higher dimensions. The proof is left as an exercise (see Exercise 17.4.1).
Theorem 17.1.3 Given a set P of n points in IRd , one can compute a spanning tree T of P such that each line crosses at most
O(n1−1/d ) edges of T. The running time is polynomial in n.
√
Lemma 17.1.4 One can compute a perfect matching M of a set of n points in the plane, such that every line crosses at most O( n)
edges of the matching.
(A somewhat similar argument is being used in the 2-approximation algorithm for TSP with the triangle inequality.)
Now, going back to the discrepancy question, we remind the reader that we would like to color the points by {−1, 1} such that
for any halfplane the ‘balance’ of the coloring is as close to perfect as possible.
√ To this end, we use the matching of the above
lemma, and plug it into Theorem 5.4.1. Since any line ` crosses at most #` = O( n) edges of M, we get the following result.
Theorem 17.1.5 Let P be a set of n points in the plane. One can compute a coloring χ of P by {−1, 1}, such that for all halfplane
h, it holds √
|χ(h)| = O n1/4 ln n .
√
Namely, the discrepancy of n points in relation to half planes is O(n1/4 ln n). This also implies that one can construct a better
ε-sample in this case. But before dwelling into this, let us prove a more general version of the spanning tree lemma.
Lemma 17.1.6 Let P be a set of n points, and let F be a weighted set of ranges from R|P , with total weight W. Then, there is a pair
of points e = p, q ⊆ P, such that the total weight of the ranges crossed by e is at most O(Wn−1/d log(n)).
Proof: Let ε be a parameter to be specified shortly. For an edge u, v ⊆ P, consider the set of edges that its crosses
C(u, v) = r (u ∈ r and v < r) or (u < r and v ∈ r) , for r ∈ F .
155
So, consider the dual range space S? = (R, X? )´ This range space has shattering dimension bounded by d, by assumption. Consider
the new range space
T = R, r ⊕ r0 r, r0 ∈ X?
, where ⊕ is the symmetric difference of the two sets; namely, r ⊕ r0 = (r \ r0 ) ∪(r0 \ r). By arguing as in Corollary 5.2.7, we have
that T has a shattering dimension at most 2d. Furthermore, the projected range space T|F has C(u, v) as a range.
So consider a random sample R of size O((d/ε) log(d/ε)) from F (note that F is weighted, and the random sampling is done
accordingly). By the ε-net theorem (Theorem 5.3.4), we know that with constant probability, this is an ε-net. Namely, a set C(u, v)
which do not contain any range of R has weight at most εW.
On the other hand, the range space S?|R = (R, X?|R ) has at most
!d
d d
µ = O |R|d = c log
ε ε
ranges , since the dual range space has shattering dimension d, where c is an appropriate constant. In particular, let us pick ε as
large as possible, such that µ < n = |P|. We are then guaranteed that there are two points p, q ∈ P such that R p = Rq . In particular,
the total weight of C(u, v) is εW. Thus, we left with the task of picking ε.
So, for ε = c1 n−1/d log(n) it holds that µ < n, if c1 is sufficiently large, as can be verified.
To complete the argument, we need to argue about the total weight of the ranges in the end of this process. It is bounded by
!
Yn
log(i) X log(i)
d0 d0 0
U=n 1 + c1 1/d ≤ n exp c1 1/d ≤ nd exp O c1 n1−1/d log(n)
i=1
i i
i
Now, the crossing number of the resulting tree T, for any range r ∈ R is bounded by lg(U) = O(d0 log n + n1−1/d log(n)). We thus
conclude:
Theorem 17.1.7 Given a range space S = (X, R) with shattering dimension d0 and dual shattering dimension d, and a set P ⊆ X
of n points, one can compute, in polynomial time, a spanning tree T of P, such that any range of R crosses at most O(d0 log n +
n1−1/d log(n)) edges of T.
The algorithm. Interestingly, one can do much better if the set system S has bounded dual shattering dimension d. Indeed, let
us assign weight 1 to each range of R, and pick a random subset R of size O((d/ε) log(d/ε)) from R, where ε = 1/(4k) (the sample
is done according to the weights). If the sets of R covers all the points of R, then we are done. Otherwise, consider a point p ∈ X
P
which is not covered by R. If the total weight of Rp (i.e., the set of ranges covering p) is smaller than εW(R) = r∈R ωr then we
redouble the weight of all the sets of Rp . In any case, even if doubling is not carried out, we repeat this process till it succeeds.
Details and Intuition. In the above algorithm, if a random sample fails (i.e., there is an uncovered point) then one of the
ranges that covers p must be in the optimal solution. In particular, by increasing the weight of the ranges covering p, we improve
the probability that p would be covered in the next iteration. Furthermore, with good probability, the sample is an ε-net, and as
such the algorithm doubles the weight of a “few” ranges. One of these few ranges must be in the optimal solution. As such, the
weight of the optimal solution grows exponentially in a faster rate than the total weight of the universe, implying that at some point
the algorithm must terminates, as the weight of the optimal solution exceeds the total weight of all the ranges, which is of course
impossible.
´
We remind the reader that X? = Rp p ∈ X and Rp = r r ∈ R, the range r contains p .
If the ranges of R are geometric shapes, it means that the arrangement formed by the shapes of R has at most µ faces.
156
17.2.1 Proof of correctness
In the following, we bound the number of iterations performed by the algorithm. As before, let W0 = m be the initial weight of the
ranges, and Wi would be the weight in the end of the ith iteration. We consider an iteration to be successful if the doubling stage is
being performed in the iteration. Since an iteration is successful if the sample is an ε-net, and by Theorem 5.3.4 the probability for
that is at least, say, 1/2, it follows that we need to bound only the number of successful iterations (indeed, it would be easy to verify
that with high probability, using the Chernoff inequality, the number of successful iteration is at least a quarter of all iterations
performed).
As before, we know that Wi ≤ (1 + ε)Wi−1 = (1 + ε)i m ≤ m exp(εi). On the other hand, in each iteration the algorithm “hits” at
least one of the ranges in the optimal solution. Let ti ( j) be the number of times the weight of the jth range in the optimal solution
was doubled, for j = 1, . . . , k, where k is the size of the optimal solution. Clearly, the weight of the universe in the ith iteration is at
least
Xk
2ti ( j) .
j=1
But this quantity is minimized when ti (1), . . . , eti (k) are as equal to each other as possible. (Indeed, 2a + 2b ≥ 2 · 2b(a+b)/2c , for any
integers a, b ≥ 0.) As such, we have that
Xk
k2bi/kc ≤ 2ti ( j) ≤ Wi ≤ m exp(εi) .
j=1
t
So, consider i = tk, for t an integer. We have that k2 ≤ m exp(εtk) = m exp(t/4), since ε = 1/4k. Namely,
t t
lg k + t ≤ lg m + lg e ≤ lg m + ,
4 2
as lg e ≤ 1.45. Namely, t ≤ 2 lg(m/k). We conclude that the algorithm performs at most 2 lg(m/k) successful iterations, and as such
it performs at most O(log(m/k)) iterations overall.
Running time. It is easy to verify that with careful implementation the sampling stage can be carried out in linear time. The
size of the resulting approximation is O((d/ε) log(d/ε)) = O(dk log dk). Checking if all the points are covered by the random
sample takes O(ndk log dk) time, assuming we can in constant time determine if a point is inside a range. Computing the total
weight of the ranges covering p takes O(m) time. Thus, each iteration takes O(m + ndk log dk) time.
Note, that we assumed that k is known to us in advance. This can be easily overcome by doing an exponential search for the
right value of k. Given a guess κ to the value of k, we will run the algorithm with κ. If the algorithm exceeds c log(m/k) iterations
without terminating, we know the guess is too small and we continue to try 2κ. We thus get the following.
Theorem 17.2.1 Given a finite range space S = (X, R) with n points and m ranges, and S has a dual shattering dimension d, then
one can compute a cover of X, using the ranges of R, such that the cover uses O(dk log(dk)) sets, where k is the size of the optimal
set cover.
The running time of the algorithm is O((m + ndk log(dk)) log(m/k) log n) time, with high probability, assuming that in constant
time we can decide if a point is inside a range.
Theorem 17.2.2 The VC dimension of the range space formed by all visibility polygons inside a polygon P (i.e., S above) is a
constant.
®
As such, under our definition of visibility, one can see through a reflex corner of a polygon.
157
So, consider a simple polygon P with n vertices, and let b P be a (finite) set of visibility polygons that covers P (say, all the
visibility polygons induced by vertices of P ). The range space induced by P and b P has finite VC dimension (since this range space
is contained inside the range space of Theorem 17.2.2). To make things more concrete, consider placing a point inside each face of
the arrangement of the visibility polygons of bP inside P , and let U denote
the resulting
set of points. Next, consider the projection
b
of this range space into U; namely, consider the range space S = U, Q ∩ U Q ∈ P . Clearly, a set cover of minimal size for S
corresponds to a minimal number of visibility polygons of bP that covers P . Now, S has size polynomial in P and b P and it can be
clearly be computed efficiently. We can now apply the algorithm of Theorem 17.2.1 to it, and get a “small” set cover. We conclude:
Theorem 17.2.3 Given a simple polygon P of size n, and a set of visibility polygons b P (of polynomial size) that covers P , then one
can compute, in polynomial time, a cover of P using O(kopt log kopt ) polygons of b
P, where kopt is the smallest number of polygons of
b
P that completely covers P .
Proof: Let S be the set of k points under consideration, which are on one side of s and Q is on the other side. Each of these
points sees a subinterval of s (inside P ), and this interval induces a wedge in the plane on the other side of s (we ignore for
the time being the boundary of P ). Also, consider the lines passing through pair of points of S . Together, this set of O(k2 )
lines, rays and segments, partition the place into O(k4 ) different faces, so that for a point inside a face of this arrangement, it
sees the same subset of points of S through the segment s in the same radial order. So, consider a point p inside such a face f ,
and let q1 , . . . , qm be the (clockwise ordered) list of points of S that p sees. To visualize this list, connect p to each one of this
points by a segment. Now, as we now introduce the boundaries of P , some of these segments are no longer realizable since
the intersect the boundary of polygon. Observe, however, the set of visible points from p must be a consecutive subsequence
of the above list; namely, p can see inside P the points qL , . . . , qR , for some L ≤ R. We conclude that since there are O(k2 )
choices for the indices L and R, it follows that inside f there are most O(k2 ) different subsets of S that are realizable inside
Q . Now, there are O(k4 ) such faces in the arrangement, which implies the bound.
Now, consider a triangulation of P . There must be a triangle in this triangulation, such that if we remove it, then every one of
the remaining pieces Q1 , Q2 , Q3 contains at most 2k/3 points of U. Let 4 be this triangle.
Clearly, the complexity of the visibility polygons of b P inside 4 is O(k2 ). Furthermore, inside Qi one can see only T out (k, Qi )
different subsets of the points of U outside Qi , for i = 1, 2, 3. Thus, the total number of different subsets of U one can see is bounded
by
X
3
T (k, P ) ≤ O(k2 ) + T out (k, Qi ) · T (|U ∩ Qi | , Qi ) = O(k2 ) + O(k6 )T (2k/3) = kO(log k) ,
i=1
by Lemma 17.2.4. However, for U to be shattered, we need that T (k, P ) ≥ 2k . Namely, we have that
2k ≤ kO(log k) .
158
A natural question is whether one can compute a spanning tree which is weight sensitive. Namely, √ compute a spanning tree of
n points in the plane, such that if a line has only k points on one side of it, it would cross at most O( k) edges of the tree. A slightly
weaker bound is currently known, see Exercise 17.4.3, which is from [HS06]. This leads to small ε-samples which work for ranges
which are heavy enough.
Another natural question is whether given a set of lines and points in the plane, can one compute a spanning tree with overall
small number of crossings. Surprisingly, one can do (1 + ε)-approximation for the best tree in near linear time, see [HI00]. This
result also extends to higher dimensions.
The algorithm we described for Set Cover falls into a general method of multiplicative weights update. Algorithms of this
family include Littleston’s Winnow algorithm [Lit88], Adaboost [FS97], and many more. In fact, the basic idea can be tracked back
to the fifties. See [AHK06] for a nice survey of this method.
17.4 Exercises
Exercise 17.4.1 Prove Theorem 17.1.3.
Exercise 17.4.2 Show that in the worst case,√one can pick a point set P of n points in the plane, such that any spanning tree T of
P, there exists a line `, such that ` crosses Ω( n) edges of T.
Exercise 17.4.3 [20 Points] Let P be a set of n points in the plane. For a line `, let w+` (resp., w−` ) be the number of points of P
lying above (resp., below or on) `, and define the weight of `, denoted by ω`, to be min(w+` , w−` ).
√
Show, that one can construct a spanning tree T for P such that any line ` crosses at most O( ω` log(n/ω`)) edges of T.
159
160
Chapter 18
Isn’t it an artificial, sterilized, didactically pruned world, a mere sham world in which you cravenly vegetate, a world
without vices, without passions without hunger, without sap and salt, a world without family, without mothers,
without children, almost without women? The instinctual life is tamed by meditation. For generations you have left
to others dangerous, daring, and responsible things like economics, law, and politics. Cowardly and well-protected,
fed by others, and having few burdensome duties, you lead your drones’ lives, and so that they won’t be too boring
you busy yourselves with all these erudite specialties, count syllables and letters, make music, and play the Glass
Bead Game, while outside in the filth of the world poor harried people live real lives and do real work.
– The Glass Bead Game, Hermann Hesse
Proof: The upper bound on the width is obvious. As for the lower bound, observe that P encloses a ball of radius 1/2, and as such
its projection in any direction is at least 1.
The diameter is just the distance between the two points (0, . . . , 0) and (1, . . . , 1).
Lemma 18.1.2 Let P be a convex body, and let h be a hyperplane cutting P. Let µ = Vol(h ∩ P), and let ~v be unit vector orthogonal
to h. Let ρ be the length of the projection of P into the direction of ~v. Then Vol(P) ≥ µ · v/d.
Proof: Let P+ = P ∩ h+ and a be the point of maximum distance of P+ from h, where h+ denotes the positive half space induced
by h.
Let C = h ∩ P. By the convexity of P, the pyramid R = CH(C ∪ {a}) is contained inside P, where CH(S ) denotes the
convex-hull of S . We explicitly compute the volume of R for the sake of completeness. To this end, let s be the segment connecting
a to its projection on h, and let α denote the length of s. Parameterizing s by the interval [0, ρ− ], let C(t) denote the intersection
of the hyperplane passing through s(t) and R, where s(0) = a, and s(t) ∈ h. Clearly, Vol(C(t)) = (t/α)d−1 Vol(C) = (t/α)d−1 µ. (We
abuse notations a bit and refer to the (d − 1)-dimensional volume and d-dimensional volume by the same notation.) Thus,
Z α Z α Z α
µ µ αd αµ
Vol(R) = Vol(C(t)) dt = (t/α)d−1 dt = d−1 td−1 dt = d−1 · = .
t=0 t=0 α t=0 α d d
Thus, Vol(P+ ) ≥ αµ/d. Similar argumentation can be applied to P− = P ∩ h− . Thus, we have Vol(P) ≥ ρµ/d.
Proof: Since the width of H = [0, 1]d is 1 and thus the projection of H on the direction perpendicular to h is of length ≥ 1. As
such, by Lemma 18.1.2 if Vol(h ∩ H) > d then Vol(H) > 1. A contradiction.
161
Lemma 18.1.4 Let P ⊆ [0, 1]d be a convex body. Let µ = Vol(P). Then ω(P) ≥ µ/d and P contains a ball of radius µ/(2d2 ).
Proof: By Lemma 18.1.3, any hyperplane cut P in a set of volume at most d. Thus, µ = Vol(P) ≤ ω(P)d. Namely, ω(P) ≥ µ/d.
Next, let E be the largest volume ellipsoid that is contained inside P. By John’s theorem, we have that P ⊆ dE. Let α be the
length of the shortest axis of E. Clearly, ω(P) ≤ 2dα, since ω(dE) = 2dα. Thus 2dα ≥ µ/d. This implies that α ≥ µ/(2d2 ).
Thus, E is an ellipsoid with its shortest axis is of length α ≥ µ/(2d2 ). In particular, E contains a ball of radius α, which is in
turn contained inside P.
In Lemma 18.1.4 we used the fact that r(P) ≥ ω(P)/(2d) (which we proved using John’s theorem), where r(P) is the radius
of the largest
√ ball enclosed inside P. Not surprisingly, considerably
√ better bounds are known. √ In particular, it is known that
ω(P)/(2 d) ≤ r(p) for add dimension, and ω(P)/(2(d + 1)/ d + 2) ≤ r(P). Thus, r(p) ≥ ω(P)/(2 d + 1) [GK92]. Plugging this
fact into the proof of Lemma 18.1.4, will give us slightly better result.
Proof: Let B be the minimum axis-parallel box containing S , and let s and t be the points in S that √ define√ the longest edge of
B, whose length is denoted by l. By the diameter definition, kstk ≤ diam(S ), and clearly, diam(S ) ≤ d l ≤ d kstk. The points s
and t are easily found in O(nd) time.
Alternatively, pick a point s0 ∈ S , and compute its furthest point t0 ∈ S . Next, let a, b be the two points realizing the diameter.
We have diam(S ) = kabk ≤ kas0 k + ks0 bk ≤ 2 ks0 t;k. Thus, ks0 t0 k is 2-approximation to the diameter of P.
Proof: By using the algorithm of Lemma 18.2.1 we compute in O(n) time two points s, t ∈ P which form a 2-approximation of
the diameter of P. For the simplicity of exposition, we assume that st is on the xd -axis (i.e., the line ` ≡ ∪ x (0, . . . , 0, x)), and there
is one point of S that lies on the hyperplane h ≡ xd = 0, an that xd ≥ 0 for all points of P.
Let Q be the orthogonal projection of P into h, and let I be the shortest interval on ` which contain the projection of P into ` s .
By recursion, we can compute a bounding box B0 of Q in h. Let the bounding box be B = B0 × I. Note, that in the bottom of
the recursion, the point-set is one dimensional, and the minimum interval containing the points can be computed in linear time.
Clearly, P ⊆ B, and thus we only need to bound the quality of approximation. We next show that Vol(B) ≥ Vol(P)/cd , where
C = CH(P), and cd = 2d · d!. We prove this by induction on the dimension. For d = 1 the claim trivially holds. Otherwise, by
induction, that Vol(B0 ) ≥ Vol(C 0 )/cd−1 , where C 0 = CH(Q).
For a point p ∈ C 0 , let ` p be the line parallel to xd -axis passing through p. Let L(p) be the minimum value of xd for the
points of ` p lying inside C, and similarly, let U(p) be the maximum value of xd for the points of ` p lying inside C. That is
` p ∩ C = [L[p), U(p)]. Clearly, since C is convex, the function L(·) is concave, and U(·) is convex. As such, γ(p) = U(p) − L(p) is
a convex function, being the difference between a convex and a concave function. In particular, γ(·) induces the following convex
body
[
U= (x, 0), (x, γ(x)) .
x∈C 0
Clearly, Vol(U) = Vol(C). Furthermore, γ((0, . . . , 0)) ≥ kstk and U is shaped like a “pyramid” its base is on the hyperplane xd = 0
is the set C 0 , and the segment [(0, . . . , 0), (0, . . . , 0, kstk)] is contained inside it. Thus,
by Lemma 18.1.2. Let r = |I| be the length of the projection of S into the line `, we have that r ≤ 2 kstk. Thus,
162
Let T be an affine transformation that maps B to the unit hypercube H = [0, 1]d . Observe that Vol(T (C)) ≥ 1/cd . By
2
Lemma
√ 18.1.4, 2there
√ is ball b of radius r d≥ 1/(c d · 2d ) contained inside T (C). The ball b contains a hypercube of sidelength
2r/ d ≥ 2/(2d dcd ). Thus, for c = 1/ 2 d!d , there exists a vector v0 ∈ IRd , such that cH + v0 ⊆ T (C). Thus, applying T −1 to
5/2
both sides, we have that there exists a vector v ∈ IRd , such that c · B + v = c · T −1 (H) + v ⊆ C.
Lemma 18.4.1 If Bopt is the minimum volume bounding box of P, then it has two adjacent faces which are flush.
This provides us with a natural algorithm to compute the minimum volume bounding box. Indeed, let us check all possible
pair of edges e, e0 ∈ CH(P). For each such pair, compute the minimum volume bounding box that has e and e0 as flush.
Consider the normal ~n of the face of a bounding box that contains e. The normal ~n lie on a great circle on the sphere of
directions, which are all the directions that are orthogonal to e. Let us parameterize ~n by a point on this normal. Next, consider
the normal n~0 to the face that is flush to e0 . Clearly, n~0 is orthogonal both to e0 and ~n. As such, we can compute this normal in
constant time. Similarly, we can compute the third direction of the bounding box using vector product in const time. Thus, if e
and e0 are fixed, there is one dimensional family of bounding boxes of P that have e and e0 flush on them, and comply with all the
requirements to be a minimum volume bounding box.
It is now easy to verify that we can compute the representation of this family of bounding boxes, by tracking what vertices of
the convex-hull the bounding boxes touches (i.e., this is similar to the rotating calipers algorithm, but one has to be more careful
about the details). This can be done in linear time, and as such, one con compute the minimum volume bounding box in this family
in linear time. Doing this for all pair of edges, results in O(n3 ) time algorithm, where n = |P|.
Theorem 18.4.2 Let P be a set of n points in IR3 . One can compute the minimum volume bounding box of P in O(n3 ) time.
163
be the set of eight vertices of the cell of G that contains p, and let S G = ∪ p∈S G(p). Define P = CH(S G ). Clearly, CH(P) ⊆ P ⊆ Q.
Moreover, one can compute P in O(n + (1/ε2 ) log (1/ε)) time. On the other hand, P ⊆ B ⊕ Bε . The latter term is a box which
contains at most k = 2c/ε + 1 grid points along each of the directions set by B, so k is also an upper bound for the number of grid
points contained by P in each direction. As such, the convex hull of CH(P) is O(k2 ), as every grid line can contribute at most two
vertices to the convex hull. Let R the set of vertices of P. We next apply the exact algorithm of Theorem 18.4.2 to R. Let b B denote
the resulting bounding box.
It remains to show that b B is a (1 + ε)-approximation of Bopt (P). Let Bεopt be a translation of 4ε Bopt (P) that contains Bε . (The
existence of Bεopt is guaranteed by Lemma 18.3.1, if we take c = 160.) Thus, R ⊆ CH(P) ⊕ Bε ⊆ CH(P) ⊕ Bεopt ⊆ Bopt (P) ⊕ Bεopt .
Since Bopt (P) ⊕ Bεopt is a box, it is a bounding box of P and therefore also of CH(P). Its volume is
ε 3
Vol(Bopt (P) ⊕ Bεopt ) = 1 + Vol(Bopt (P)) < (1 + ε) Vol(Bopt (P)),
4
as desired. (The last inequality is the only place where we use the assumption ε ≤ 1.)
To recap, the algorithm consists of the four following steps:
1. Compute the box B(P) (see Lemma 18.3.1) in O(n) time.
2. Compute the point set S G in O(n) time.
3. Compute P = CH(S G ) in O(n + (1/ε2 ) log (1/ε)) time. This is done by computing the convex hull of all the extreme points
of S G along vertical lines of G. We have O(1/ε2 ) such points, thus computing their convex hull takes O((1/ε2 ) log(1/ε))
time. Let R be the set of vertices of P.
4. Compute Bopt (R) by the algorithm of Theorem 18.4.2. This step requires O((1/ε2 )3 ) = O(1/ε6 ) time.
Theorem 18.5.1 Let P be a set of n points in IR3 , and let 0 < ε ≤ 1 be a parameter. One can compute in O(n + 1/ε6 ) time a
bounding box B(P) with Vol(B(P)) ≤ (1 + ε) Vol(Bopt (P)).
Note that the box B(S ) computed by the above algorithm is most likely not minimal along its directions. The minimum
bounding box of P homothet of B(S ) can be computed in additional O(n) time.
Conjecture 18.6.1 The constants in Lemma 18.3.1 can be improved to be polynomial in the dimension.
Coresets. One alternative approach to the algorithm of Theorem 18.5.1 is to construct G using Bε /2 as before, and picking
from each non-empty cell of G, one point of P as a representative point. This results in a set S of O(1/ε2 ) points. Compute the
minimum volume bounding box S using the exact algorithm. Let B denote the resulting bounding box. It is easy to verify that
(1 + ε)B contains P, and that it is a (1 + ε)-approximation to the optimal bounding box of P. The running time of the new algorithm
is identical. The interesting property is that we are running the exact algorithm on on a subset of the input.
This is a powerful technique for approximation algorithms. You first extract a small subset from the input, and run an exact
algorithm on this input, making sure that the result provides the required approximation. The subset S is referred to as coreset of B
as it preserves a geometric property of P (in our case, the minimum volume bounding box). We will see more about this notion in
the following lectures.
164
Chapter 19
“From the days of John the Baptist until now, the kingdom of heaven suffereth violence, and the violent bear it
away.”
– – Matthew 11:12
Claim 19.1.1 Let P be a set of points in IRd , 0 < ε ≤ 1 a parameter and let Q be a δ-coreset of P for directional width, for
δ = ε/(8d). Let Bopt (Q) denote the minimum volume bounding box of Q. Let B0 be the rescaling of Bopt (Q) around its center by a
factor of (1 + 3δ).
Then, P ⊂ B0 , and in particular, Vol(B0 ) ≤ (1 + ε)Bopt (P), where Bopt (P) denotes the minimum volume bounding box of P.
Proof: Let v be a direction parallel to one of the edges of Bopt (Q), and let ` be a line through the origin with the direction of
v. Let I and I 0 be the projection of Bopt (Q) and B0 , respectively, into `. Let IP be the interval formed by the projection of CH(P)
into `. We have that I ⊆ IP and |I| ≥ (1 − δ) |IP |. The interval I 0 is the result of expanding I around
its center point c by a factor of
1 + 3δ. In particular, the distance between c and the furthest endpoint of I p is ≤ (1 − (1 − δ)/2) I p = (1 + δ) I p /2. Thus, we need
to verify that after the expansion of I it contains this endpoint. Namely,
1−δ 1 + 2δ − 3δ2 1+δ
(1 + 3δ) |IP | ≥ |IP | ≥ |IP | ,
2 2 2
for δ ≤ 1/3. Thus, IP ⊆ I 0 and P ⊆ B0 .
Observe that Vol(B0 ) ≤ (1 + 3δ)d Vol(Bopt (Q)) ≤ exp(3δd) Vol(Bopt (Q)) ≤ (1 + ε)Bopt (Q) ≤ (1 + ε)Bopt (P).
It is easy to verify, that a coreset for directional width, also preserves (approximately) the diameter and width of the point set.
Namely, it captures “well” the geometry of P. Claim 19.1.1 hints on the connection between coreset for directional width and the
minimum volume bounding box. In particular, if we have a good bounding box, we can compute a small coreset for directional
width.
Lemma 19.1.2 Let P be a set of n points in IRd , and let B be a bounding box of P, such that v + cd B ⊆ CH(P), where v is a vector
in IRd , cd = (4d + 1)d and cd B denote the rescaling of B by a factor of cd around its center point.
Then, one can compute
a ε-coreset S for directional width of P. The size of the coreset is O(1/εd−1 ), and construction time is
d−1
O n + min(n, 1/ε ) .
165
Proof: We partition B into a grid, by breaking each edge of B into M = d4/(εcd )e equal length intervals (namely, we tile B with
M d copies of B/M). A cell in this grid is uniquely defined by a d-tuple (i1 , . . . , id ). In particular, for a point p ∈ P, lets I(p) denote
the ID of this point. Clearly, I(p) can be computed in constant time.
Given a (d − 1)-tuple I = (i1 , . . . , id−1 ) its pillar is the set of grid cells that have (i1 , . . . , id−1 ) as the first d − 1 coordinates of
their ID. Scan the points of P, and for each pillar record the highest and lowest point encountered. Here highest/lowest refer to their
value in the dth direction.
We claim that the resulting set S is the required coreset. Indeed, consider a direction v ∈ S(d−1) , and a point p ∈ P. Let q, q0 be
the highest and lowest points in P which are inside the pillar of p, and are thus in S. Let Bq , Bq0 be the two grid cells containing q
and q0 . Clearly, the projection of CH(Bq ∪ Bq0 ) into the direction of v contains the projection of p into the direction of v. Thus, for
a vertex u of Bq it is sufficient to show that
ω(v, {u, q}) ≤ ω(v, B/M) ≤ (ε/2)ω(v, P),
since this implies that ω(v, S) ≥ ω(v, P) − 2ω(v, B/M) ≥ (1 − ε)ω(v, P). Indeed,
ω(v, P) ε
ω(v, B/M) ≤ ω(v, B)/M ≤ ω(v, P)/(cd M) ≤ ≤ ω(v, P).
4/ε 4
As for the preprocessing time, it requires a somewhat careful implementation. We construct a hash-table (of size O(n)) and
store for every pillar the top and bottom points encountered. When handling a point this hash table can be updated in constant time.
Once the coreset was computed, the coreset can be extracted from the hash-table in linear time.
Theorem 19.1.3 Let P be a set of n points in IRd , and let 0 < ε < 1 be a parameter.
One can compute
a ε-coreset S for directional
width of P. The size of the coreset is O(1/εd−1 ), and construction time is O n + min(n, 1/εd−1 ) .
Proof: Compute a good bounding box of P using Lemma 18.3.1. Then apply Lemma 19.1.2 to P.
Proof: Using the algorithm of Lemma 18.3.1 compute, in O(n) time, a bounding B of P, such that there exists a vector w ~ such
~ + d(4d1+1) B ⊆ CH(P) ⊆ B.
that w
Let T 1 be the linear transformation that translates the center of T to the origin. Clearly, S ⊆ P is a ε-coreset for direction width
of P, if and only if S1 = T 1 (S) is an ε-coreset for directional width of P1 = T 1 (P). Next, let T 2 be a rotation that rotates B1 = T 1 (B)
such that its sides are parallel to the axises. Again, S2 = T 2 (S1 ) is a ε-coreset for P2 = T 2 (P1 ) if and only if S1 is a coreset for P1 .
Finally, let T 3 be a scaling of the axises such that B2 = T 2 (B1 ) is mapped to the hypercube H.
Note, that T 3 is just a diagonal matrix. As such, for any p ∈ IRd , and a vector v ∈ S(d−1) we have
T D E
hv, T 3 pi = vT T 3 p = T 3T v p == T 3T v, p = hT 3 v, pi .
= ω(T 3 v, P2 ).
Similarly, ω(v, S3 ) = ω(T 3 v, S2 ).
By definition, S2 is a ε-coreset for P2 iff for any non zero v ∈ IRd , we have ω(v, P2 ) ≥ (1 − ε)ω(v, S2 ). Since T 3 non
singular, this implies that for any non-zero v, we have ω(T 3 v, S2 ) ≥ (1 − ε)ω(T 3 v, P2 ), which holds iff ω(v, S3 ) = ω(v, T 3 (S2 )) ≥
(1 − ε)ω(v, T 3 (P2 )) = (1 − ε)ω(v, P3 ). Thus S3 is a ε-coreset for P3 . Clearly, the other direction holds by a similar argumentation.
Set T = T 3 T 2 T 1 , and observe that, by the above argumentation, S is a ε-coreset for P if and only if T (S) is a ε-coreset for
T (P). However, note that T (B) = H, and T (~ w + d(4d1+1) B) ⊆ CH(T (P)). Namely, there exists a vector w~0 such that w~0 + d(4d1+1) H ⊆
CH(T (P)) ⊆ H. Namely, the point set T (P) is α = d(4d1+1) -fat.
166
p
(L − r)2 + r2 p
x1
{
r
L−r (L, 0)
Proof: For any vector v, we have ω(v, A) ≥ (1 − δ)ω(v, B) ≥ (1 − δ)(1 − ε)ω(v, C) ≥ (1 − δ − ε)ω(v, C).
Thus, given a point-set P, we can first extract from it a ε/2-coreset of size O(1/εd−1 ), using Lemma 19.1.2. Let Q denote the
resulting set. We will compute a ε/2-coreset for Q, which would be by the above observation a ε-coreset for directional width of P.
We need the following technical lemma.
Lemma 19.2.3 Let b be a ball of radius r centered at (L, 0, . . . , 0) ∈ IRd , where L ≥ 2r. Let p be an arbitrary point in b, and let b0
be the largest ball centered at p and touching the origin. Then, we have that for µ(p) = min 0 x1 we have µ(p) ≥ −r2 /L.
(x1 ,x2 ,...,xd )∈b
Proof: Clearly, if we move p in parallel to the x1 -axis by decreasing the value of x1 , we are decreasing the value of µ(p). Thus,
in the worst case x1 (p) = L − r. Similarly, the farther away p is from the x1 -axis the smaller µ(p)
p is. Thus, by symmetry, the worst
case is when p = (L − r, r, 0, . . . , 0). See Figure 19.1. The distance between p and the origin is (L − r)2 + r2 , and
p (L − r)2 − (L − r)2 − r2 r2 r2
µ(p) = (L − r) − (L − r)2 + r2 = p ≥− ≥− ,
(L − r) + (L − r)2 + r2 2(L − r) L
since L ≥ 2r.
Lemma 19.2.4 Let Q be a set of m points. Then one can compute a ε/2-coreset for Q of size O(1/ε(d−1)/2 ), in time O(m/ε(d−1)/2 ).
Proof: Note, that byLemma 19.2.1, we can assume that Q is α-fat for some constant α, and v + α[−1, 1]d ⊆ CH(Q) ⊆ [−1, 1]d ,
where v ∈ IRd . In particular, for any√direction u ∈ S(d−1) , we have ω(u, Q) ≥√2α.
Let S be the sphere of radius d + 1 centered at the origin. Set δ = εα/4 ≤ 1/4. One can construct a set I of O(1/δd−1 ) =
O(1/ε(d−1)/2 ) points on the sphere S so that for any point x on S, there exists a point y ∈ I such that kx − yk ≤ δ. We process Q into
a data structure that can answer ε-approximate nearest-neighbor queries. For a query point q, let φ(q) be the point of Q returned
by this data structure. For each point y ∈ I, we compute φ(y) using this data structure. We return the set S = {φ(y) | y ∈ I}; see
Figure 19.2 (ii).
We now show that S is an (ε/2)-coreset of Q. For simplicity, we prove the claim under the assumption that φ(y) is the exact
nearest-neighbor of y in Q. Fix a direction u ∈ S(d−1) . Let σ ∈ Q be the point that maximizes hu, pi over all p ∈ Q. Suppose the ray
emanating from σ in direction u hits S at a point x. We know that there exists a point y ∈ I such that kx − yk ≤ δ. If φ(y) = σ, then
σ ∈ S and
maxhu, pi − maxhu, qi = 0.
p∈Q q∈S
Now suppose φ(y) , σ. Rotate and translate space, such that σ is at the origin, and u is the positive x1 axis. Setting L = koxk and
r = δ, we have that hu, yi ≥ −r2 /L ≥ −δ2 /1 = −δ2 = −εα/4, by Lemma 19.2.3. We conclude that ω(u, S) ≥ ω(u, Q) − 2(εα/4) =
ω(u, Q) − εα/2. On the other hand, since ω(u, Q) ≥ 2α, it follows tat ω(u, S) ≥ (1 − ε/2)ω(u, Q).
As for the running time, we just perform the scan in the most naive way to find φ(y) for each y ∈ I. Thus, the running time is
as stated.
Theorem 19.2.5 Let P be a set of n points in IRd . One can compute a ε-coreset for directional width of P in O(n + 1/ε3(d−1)/2 ) time.
The coreset size is O(ε(d−1)/2 ).
167
B
(y)
y y
CH(P )
w
z x
u h
(i) (ii)
Figure 19.2: (i) An improved algorithm. (ii) Correctness of the improved algorithm.
Proof: We use the algorithm of Lemma 19.2.1 and Lemma 19.1.2 on the resulting set. This computes a ε/2-coreset Q of P of size
O(1/εd−1 ). Next, we apply Lemma 19.2.4 and compute a ε/2-coreset S of Q. This is a ε-coreset of P.
19.3 Exercises
Exercise 19.3.1 [5 Points]
Prove that in the worst case, a ε-coreset for directional width has to be of size Ω(ε−(d−1)/2 ).
168
Chapter 20
Once I sat on the steps by a gate of David’s Tower, I placed my two heavy baskets at my side. A group of tourists
was standing around their guide and I became their target marker. “You see that man with the baskets? Just right of
his head there’s an arch from the Roman period. Just right of his head.”
“But he’s moving, he’s moving!”
I said to myself: redemption will come only if their guide tells them, “You see that arch from the Roman period? It’s
not important: but next to it, left and down a bit, there sits a man who’s bought fruit and vegetables for his family.”
– –Yehuda Amichai, Tourists
20.1 Preliminaries
Definition 20.1.1 Given a set of hyperplanes H in IRd , the minimization diagram of H, known as the lower envelope of H, is the
function LH : IRd−1 → IR, where we have L(x) = minh∈H h(x), for x ∈ IRd−1 .
Similarly, the upper envelope of H is the function U(x) = maxh∈H h(x), for x ∈ IRd−1 .
The extent of H and x ∈ IRd−1 is the vertical distance between the upper and lower envelope at x; namely, EH (x) = U(x)−L(x).
169
upper envelope
outer
extent
lower envelope
(ii) t
(i) t
Figure 20.1: (i) The extent of the moving points, is no more than the vertical segment connecting the lower
envelope to the upper envelope. The black dots mark where the movement description of I(t) changes. (ii)
The approximate extent.
We want to maintain a vertical interval Iε+ (t) so that I(t) ⊆ Iε+ (t) and Iε+ (t) ≤ (1 + ε) |I(t)| for all t, so that the endpoints of Iε+ (t)
follow piecewise-linear trajectories, and so that the number
of combinatorial changes in I ε (t) is small. Alternatively, we want to
−
maintain a vertical interval Iε (t) ⊆ I(t) such that Iε (t) ≥ (1 − ε) |I(t)|. Clearly, having one approximation would imply the other by
−
appropriate rescaling.
Geometrically, this has the following interpretation: We want to simplify the upper and lower envelopes of A(L) by convex
and concave polygonal chains, respectively, so that the simplified upper (resp. lower) envelope lies above (resp. below) the original
upper (resp. lower) envelope and so that for any t, the vertical segment connecting the simplified envelopes is contained in (1+ε)I(t).
See Figure 20.1 (ii).
In the following, we will use duality, see Lemma 23.2.1 for the required properties we will need.
Definition 20.2.1 For a set of hyperplanes H, a subset S ⊂ H is a ε-coreset of H for the extent measure, if for any x ∈ IRd−1 we
have ES ≥ (1 − ε)EH .
Similarly, for a point-set P ⊆ IRd , a set S ⊆ P is a ε-coreset for vertical extent of P, if, for any direction v ∈ S(d−1) , we have that
µv (S) ≥ (1 − ε)µv (P), where µv (P) is the vertical distance between the two supporting hyperplanes of P which are perpendicular to
v.
Thus, to compute a coreset for a set of hyperplanes, it is by duality and Lemma 23.2.1 enough to find a coreset for the vertical
extent of a point-set.
Lemma 20.2.2 The set S is a ε-coreset of the point set P ⊆ IRd for vertical extent if and only if S is a ε-coreset for directional
width.
Proof: Consider any direction v ∈ S(d−1) , and let α be its (smaller) angle with with the xd axis. Clearly, ω(v, S) = µv (S) cos α
and ω(v, P) = µv (PntS et) cos α. Thus, if ω(v, S) ≥ (1 − ε)ω(v, P) then µv (S) ≥ (1 − ε)µv (P), and vice versa.
Theorem 20.2.3 Let H be a set of n hyperplanes in IRd . One can compute a ε-coreset of H of, size O(1/εd−1 ), in O(n +
min(n, 1/εd−1 )) time. Alternatively, one can compute a ε-coreset of size O(1/ε(d−1)/2 ), in time O(n + 1/ε3(d−1)/2 ).
Proof: By Lemma 20.2.2, the coreset computation is equivalent to computing coreset for directional width. However, this can
be done in the stated bounds, by Theorem 19.1.3 and Theorem 19.2.5.
Going back to our motivation, we have the following result:
Lemma 20.2.4 Let P(t) be a set of n points with linear motion in IRd . We can compute an axis parallel moving bounding box
√
b(t) for P(t) that changes O(d/ ε) times (in other times, the bounding box moves with linear motion). The time to compute this
bounding box is O(d(n + 1/ε3/2 )).
Furthermore, we have that Box(P(t)) ⊆ b(t) ⊆ (1 + ε)Box(P(t)), where Box(t) is the minimum axis parallel bounding box of P.
Proof: We compute the solution for each dimension separately. In each dimension, we compute a coreset of the resulting set
of lines in two dimensions, and compute the upper and lower envelope of the coreset. Finally, we expand the upper and lower
envelopes appropriately so that the include the original upper and lower envelopes. The bounds on the running time follows from
Theorem 20.2.3.
170
20.3 Coresets
At this point, our discussion exposes a very powerful technique for approximate geometric algorithms: (i) extract small subset
that represents that data well (i.e., coreset), and (ii) run some other algorithm on the coreset. To this end, we need a more unified
definition of coresets.
d
Definition 20.3.1 (Coresets) Given a set P of points (or geometric objects) in IRd , and an objective function f : 2IR → IR (say,
f (P) is the width of P), a ε-coreset is a subset S of the points of P such that
f (S) ≥ (1 − ε) f (P).
We will state this fact, by saying that S is a ε-coreset of P for f (·).
If the function f (·) is parameterized, namely f (Q, v), then S ⊆ P is a coreset if
∀v f (S, v) ≥ (1 − ε) f (P, v).
As a concrete example, for v a unit vector, consider the function ω(v, P) which is the directional width of P; namely, it is the
length of the projection of CH(P) into the direction of v.
Coresets are of interest when they can be computed quickly, and have small size, hopefully of size independent of n, the size
of the input set P. Interestingly, our current techniques are almost sufficient to show the existence of coresets for a large family of
problems.
Theorem 20.4.1 Given a family of d-variate polynomials F = { f1 , . . . , fn }, and parameter ε, one can compute, in O(n + 1/ε s ) time,
a subset F 0 ⊆ F of O(1/ε s ) polynomials, such that F 0 is a ε-coreset of F for the extent measure. Here s is the number of different
monomials present in the polynomials of F .
Alternatively, one can compute a ε-coreset, of size O(1/ε s/2 ), in tiem O(n + 1/ε3s/2 ).
171
Proof: Let F 2 denote the family { f1 , . . . , fn }.n Using the algorithm
o of Theorem 20.4.1, we compute a δ0 -coreset G2 ⊆ F 2 of F 2 ,
0 2 1/2 2
where δ = ε /64. Let G ⊆ F denote the family ( fi ) | fi ∈ G .
Consider any point x ∈ IRk . We have that EG2 (x) ≥ (1 − δ0 )EF 2 (x), and let a = LF 2 (x), A = LG2 (x), B = UG2 (x), and
b = UF 2 (x). Clearly, we have 0 ≤ a ≤ A ≤ B ≤ b and B − A ≥ (1 − δ0 )(b − a). Since (1 + 2δ0 )(1 − δ0 ) ≥ 0, we have that
(1 + 2δ0 )(B − A) ≥ b − a. √ √ √ √ √ √
√ By√ Lemma 20.5.2 √ we have that Then, A − a ≤ (ε/2)U , and b − B ≤ (ε/2)U, where U = B − A. Namely,
√ below,
B − A ≥ (1 − ε)( b − a). Namely, G is a ε-coreset for the extent of F .
The bounds on the size of G and the running time are easily verified.
Lemma
√ √ 20.5.2 Let 0 ≤ a √
≤ A ≤√B ≤ b, and 0 < ε ≤ 1 be given
√ parameters,
√ so that b − a ≤ (1 + δ)(B − A), where δ = ε2 /16. Then,
A − a ≤ (ε/2)U , and b − B ≤ (ε/2)U, where U = B − A.
Proof: Clearly,
√ √ √ √ √ √ √ √ √ √ √
A+ B ≤ a + A − a + b ≤ a + δb + b ≤ (1 + δ)( a + b).
√ √
A+√ B √ √
Namely, 1+ δ
≤ a + b. On the other hand,
√ √ b−a (1 + δ)(B − A) √ B−A
b− a = √ √ ≤ √ √ ≤ (1 + δ)(1 + δ) √ √
b+ a b+ a B+ A
√ √ √ √
= ‘(1 + ε2 /16)(1 + ε/4)( B − A) ≤ (1 + ε/2)( B − A).
20.6 Applications
20.6.1 Minimum Width Annulus
Let P = {p1 , . . . , pq
n } be a set of n points in the plane. Let fi (q) denote the distance of the ith point from the point q. It is easy to
2 2
verify that fi (q) = xq − x pi + yq − y pi . Let F = { f1 , . . . , fn }. It is easy to verify that for a center point x ∈ IR2 , the width of the
minimum width annulus containing P which is centered at x has width EF (x). Thus, we would like to compute a ε-coreset for F .
2 2
Consider the set of functions F 2 . Clearly, fi2 (x, y) = x − x pi + y − y pi = x2 − 2x pi x + x2pi + y2 − 2y pi y + y2pi . Clearly,
all the functions of F 2 have this (additive) common
factor of x2 + y2 . Since we only care about the vertical extent, we have
H = −2x p x + x2 − 2y p y + y2 i = 1, . . . , n has the same extent as F 2 ; formally, for any x ∈ IR2 , we have EF 2 (x) = EH (x).
i pi i pi
Now, H is just a family of hyperplanes in IR3 , and it has a ε2 /64-coreset SH for the extent of size 1/ε which can be computed in
O(n + 1/ε3 ) time. This corresponds to a ε2 /64-coreset SF 2 of F 2 . By Theorem 20.5.1, this corresponds to a ε-coreset SF of F .
Finally, this corresponds to coreset S ⊆ P of size O(1/ε), such that the minimum width annulus of S, if we expand it by (1 + 2ε),
it contains all the points of P. Thus, we can just find the minimum width annulus of S. This can be done in O(1/ε2 ) time using an
exact algorithm. Putting everything together, we get:
Theorem 20.6.1 Let P be a set of n points in the plane, and let 0 ≤ ε ≤ 1 be a parameter. One can compute a (1 + ε)-approximate
minimum width annulus to P in O(n + 1/ε3 ) time.
20.7 Exercises
20.8 Bibliographical notes
Linearization was widely used in fields such as machine learning [CS00] and computational geometry [AM94].
There is a general technique for finding the best possible linearization (i.e., a mapping η with the target dimension as small as
possible), see [AM94] for details.
172
Chapter 21
Drug misuse is not a disease, it is a decision, like the decision to step out in front of a moving car. You would call
that not a disease but an error in judgment. When a bunch of people begin to do it, it is a social error, a life-style. In
this particular life-style the motto is “be happy now because tomorrow you are dying,” but the dying begins almost
at once, and the happiness is a memory. ... If there was any “sin,” it was that these people wanted to keep on
having a good time forever, and were punished for that, but, as I say, I feel that, if so, the punishment was far too
great, and I prefer to think of it only in a Greek or morally neutral way, as mere science, as deterministic impartial
cause-and-effect.
– A Scanner Darkly, Philip K. Dick
X
d X
d
fi (u, t) = hpi (t), ui = hai + bi t, ui = ai j u j + bi j · (tu j ).
j=1 j=1
Set F = { f1 , . . . , fn }. Then
ω(u, P(t)) = max hpi (t), ui − min hpi (t), ui = max fi (u, t) − min fi (u, t) = EF (u, t).
i i i i
Since F is a family of (d + 1)-variate polynomials, which admits a linearization of dimension 2d (there are 2d monomials), using
Theorem 20.4.1, we conclude the following.
Theorem 21.1.1 Given a set P of n points in IRd , each moving linearly, and a parameter ε > 0, we can compute an ε-coreset of P
for directional width of size O(1/ε2d ), in O(n + 1/ε2d ) time, or an ε-coreset of size O(1/εd ) in O(n + 1/ε3(d) ) time.
173
If the degree of motion of P is r > 1, we can write the d-variate polynomial fi (u, t) as:
X
r Xr D E
fi (u, t) = hpi (t), ui = ai j t j , u = ai j t j , u
j=0 j=0
where ai j ∈ IRd . A straightforward extension of the above argument shows that fi ’s admit a linearization of dimension (r + 1)d.
Using Theorem 20.4.1, we obtain the following.
Theorem 21.1.2 Given a set P of n moving points in IRd whose motion has degree r > 1 and a parameter ε > 0, we can compute
an ε-coreset for directional width of P of size O(1/ε(r+1)d ) in O(n + 1/ε(r+1)d ) time, or of size O(1/ε(r+1)d /2) in O(n + 1/ε3(r+1)d/2 )
time.
2
Theorem 21.1.3 Given a set P of n points in IRd and a parameter ε > 0, we can compute in O(n + 1/εO(d ) ) time a subset S ⊆ P of
2
size O(1/εO(d ) ) so that for any line ` in IRd , we have w(`, S) ≥ (1 − ε)w(`, P).
Note, that Theorem 21.1.3 does not compute the optimal cylinder, it just computes a small coreset for this problem. Clearly,
4
we can now run any brute force algorithm on this coreset. This would result in running time O(n + 1/εO(d ) ), which would output a
cylinder which if expanded by factor 1 + ε, will cover all the points of P. In fact, the running time can be further improved.
21.2 Exercises
21.3 Bibliographical notes
Section 21.1.1 is from Agarwal et al. [AHV04], and the results can be (very slightly) improved by treating the direction as (d − 1)-
dimensional entity, see [AHV04] for details.
174
Figure 21.1: Parameterization of a line ` in IR3 and its distance from a point p; the small hollow circle on `
is the point closest to p.
Theorem 21.1.3 is also from [AHV04]. The improved running time to compute the approximate cylinder mentioned in the text,
follows by a more involved algorithm, which together with the construction of the coreset, also compute a compact representation
of the extent of the coreset. The technical details are not trivial, and we skip them. In particular, the resulting running time for
2
computing the approximate cylindrical shell is O(n + 1/εO(d ) ). See [AHV04] for more details.
175
176
Chapter 22
“And so ended Svejk’s Budejovice anabasis. It is certain that if Svejk had been granted liberty of movement he
would have got to Budejovice on his own. However much the authorities may boast that it was they who brought
Svejk to his place of duty, this is nothing but a mistake. With Svejk energy and irresistible desire to fight, the
authorities action was like throwing a spanner into the works.”
– – The good soldier Svejk, Jaroslav Hasek
Definition 22.1.1 (Shell sets) Given a set P of points (or geometric objects) in IRd , and F be a family of shapes in IRd . Let
f : F → IR be a target optimization function, and assume that there is a natural expansion operation defined over F. Namely, given
a set r ∈ F, one can compute a set (1 + ε)r which is the expansion of r by a factor of 1 + ε. In particular, we would require that
f ((1 + ε)r) ≤ (1 + ε) f (r).
Let f opt (P) = minr∈F,P⊆r f (r) be the shape in F that bests fits P.
Furthermore, assume that f opt (·) is a monotone function, that is for A ⊆ B ⊆ P we have f opt (A) ≤ f opt (B).
A subset S ⊆ P is a ε-shell set for P, if SlowAlg on a set B that contains S, if the range r returned by SlowAlg(S) covers
S, (1 + ε)r covers P, and f (r) ≤ (1 + ε) f opt (S). Namely, the range (1 + ε)r is an (1 + ε)-approximation to the optimal range of F
covering P.
A shell set S is a monotone ε-shell set if for any subset B containing S, if we apply SlowAlg(B) and get the range r, then P
is contained inside (1 + ε)r and r covers B.
Note, that ε-shell sets are considerably more restricted and weaker than coresets. Of course, a ε-coreset is automatically a
(monotone) ε-shell set. Note also, that if a problem has a monotone shell set, then to approximate it efficiently, all we need to do is
to find some set, hopefully small, that contains the shell set.
177
ComputeShellSet(P)
We initialize all the points of P to have weight 1, and we repeatedly do the following:
• Pick a random sample R from P of size r = O((dimVC /δ) log(dimVC /δ)), where δ =
1/(4kopt ). With constant probability R is a δ-net for P by Theorem 5.3.4.
• Compute, using SlowAlg(R) the range r in F, such that (1 + ε)r covers R and realizes
(maybe approximately) f opt (R).
• Compute the set S of all the points of P outside (1 + ε)r. If the total weight of those points
exceeds δw(P) then the random sample is bad, and return to the first step.
• If the set S is empty then return R as the required shell set, and r as the approximation.
• Otherwise, double the weight of the points of S .
When done, return r and the set R.
Figure 22.1: The algorithm for approximating optimal cover and computing a small shell set.
Finally, assume that we know that a small monotone shell set of size kopt exists for P, but unfortunately we have no way of
computing it explicitly (because, for example, we only have a constructive proof of the existence of such a shell set).
A natural question is how to compute this small shell set quickly, or alternatively compute an approximate shell set which is
not much bigger. Clearly, once we have such a small shell set, we can approximate the optimal cover for P in F.
Example. We start with a toy example, a more interesting example is given below. Let F be the set of all balls in IRd , and let
f (r) be the radius of the ball r ∈ F. It is known that there is a ε-shell set for the minimum radius ball of size O(1/ε) (we will prove
this fact later in the course). The expansion here is the natural enlargement of a ball radius.
Random sampling from a weighted set. Another technicality is that the weights might be quite large. To overcome
this, we will store the weight of an element by storing an index i, such that the weight of the element is 2i , We still need to do m
independent draws from this weighted set. The easiest way to do that, is to compute the element e in of P in maximum weight,
and observing that all elements of weight ≤ we /n10 have weight which is so tiny, so that it can be ignored, where w p is the weight
of e. Thus, normalize all the weights of by dividing them by 2blg we /n c , and remove all elements with weights smaller than 1. For
10
ω(p) denote its normalized weight. Clearly, all the normalized weights are integers in the range 1, . . . , 2n10 . Thus,
a point p, let b
we now have to pick points for a set with (small) integer weights. Place the elements in an array, and compute the prefix sum
P
array of their weights. That is ak = ki=1 b
ω(pi ), for i = 1, . . . , n. Next, pick a random number γ uniformly in the range [0, an ], and
using a binary search, find the j, such that a j−1 ≤ γ < a j . This picks the points p j to be in the random sample. This requires O(n)
preprocessing, but a single random sample can now be done in O(log n) time. We need to perform r independent samples. Thus,
this takes O(n + r log n).
22.3.1 Correctness
Lemma 22.3.1 The algorithm described above computes a ε-shell set for P of size O(r) = O(kopt dimVC log (kopt dimVC )). The
algorithm performs O(4kopt ln n) iterations.
Proof: We only need to prove that the algorithm terminates in the claimed number of iterations. Observe, that with constant
probability (say ≥ 0.9), the sample Ri , in the ith iteration, is an δ-net for Pi−1 (the weighted version of P in the end of the (i −
178
1)th iteration),
in relation to the ranges of F. Observe, that this also implies that Ri is a δ-net for the complement family F =
d
IR \ r r ∈ F , with constant probability (since (IRd , F) and (IRd , F) have the same VC dimension).
If Ri is such a δ-net, then we know that range r we compute completely covers the set Ri , and as such, for any range r0 ∈ F that
avoids Ri we have w(r0 ) ≤ δw(Pi ). In particular, this implies that ω(S i ) ≤ δw(Pi−1 ). If not, than Ri is not a δ-net, and we resample.
The probability for that is ≤ 0.1. As such, we expect to repeat this O(1) times in each iteration, till we have w(S i ) ≤ δw(Pi−1 ).
Thus, in each iteration, the algorithm doubles the weight of at most a δ-fraction of the total point set. Thus w(Pi ) ≤ (1 +
δ)w(Pi−1 ) = n(1 + δ)i .
On the other hand, consider the smallest shell S of P, which is of size kopt . If all the elements of S are in Ri , then the algorithm
would have terminated, since S is a monotone shell set1. Thus, if we continue to the next iteration, it must be that |S ∩ S i | ≥ 1. In
particular, we are doubling the weight of at least one element of the shell set. We conclude that the weight of Pi in the ith iteration,
is at least
kopt 2i/kopt ,
since in every iteration at least one element of S gets its weight redoubled. Thus, we have
! !
i i
exp ≤ 2i/kopt ≤ kopt 2i/kopt ≤ (1 + δ)i n ≤ n · exp(δi) = n · exp .
2kopt 4kopt
Namely, exp 4kiopt ≤ n. Implying that i ≤ 4kopt ln n. Namely, after 4kopt ln n iterations the algorithm terminates, and thus returns the
required shell set and approximation.
Theorem 22.3.2 Under the settings of Section 22.2,one can compute a monotone ε-shell set for P of size O(kopt dimVC log(kopt dimVC )).
The running time of the resulting algorithm is O (n + T (kopt dimVC log(kopt dimVC )))kopt ln n , with high probability, for kopt ≤
n/ log3 n. Furthermore, one can compute an ε-approximation to f opt (P) in the same time bounds.
Proof: The algorithm is described above. The bounds on the running time follows from the bounds on the number of iterations
from Lemma 22.3.1. The only problem we need to address, is that the resampling would repeatedly fail, and the algorithm would
spend exuberant amount of time on resampling. However, the probability of failure in sampling is ≤ 0.1. Furthermore, we need at
most 4kopt log n good samples before the algorithm succeeds. It is now straightforward to show using Chernoff inequality, that with
high probability, we will perform at most 8kopt log n samplings before achieving the required number of good samples.
The natural algorithm for this problem is the greedy algorithm that repeatedly pick the set in the family F that covers the largest
number of uncovered elements in S . It is not hard to show that this provides a O(|S |) approximation. In fact, it is known that set
covering can be better approximated unless P = NP.
Assume, however, that we know that the VC dimension of the set system (S , F) has VC dimension dimVC . In fact, we need a
stronger fact, that the dual family
S = F, U(s, F) s ∈ S ,
is of low VC dimension dimVC , where U(s, F) = X s ∈ X, X ∈ F .
It turns out that the algorithm of Figure 22.1 also works in this setting. Indeed, we set the weight of the sets to 1, we pick
a random sample of sets. IF they cover the universe S , we are done. Otherwise, there must be a point p which is not covered.
Arguing as above, we know that the random sample is a δ-net of (the weighted) S, and as such all the sets containing p have total
weight ≤ δ(S). As such, double the weight of all the sets covering p, and repeat. Arguing as above, one can show that the algorithm
terminates after O(kopt log m) iterations, where m is the number of sets, where kopt is the number of sets in the optimal cover of S .
Furthermore, the size of the cover generated is O(kopt dimVC log(kopt dimVC )).
179
Theorem 22.3.3 Let (S , F) be a range space, such that the dual range space S has VC dimension dimVC . Then, one can compute a
set covering for S using sets of F using O(kopt dimVC log(kopt dimVC )) sets. This requires O(kopt log n) iterations, and takes polynomial
time.
Note, that we did not provide in Theorem 22.3.3 exact running time bounds. Usually in geometric settings, one can get
improved running time using the underlying geometry. Interestingly, the property that the dual system has low VC dimension
“buys” one a lot, as it implies that one can do O(log kopt ) approximation, instead of O(log n) in the general case.
Definition 22.5.1 Let P be a point set in IRd , 1/2 > ε > 0 a parameter.
For a cluster c, let c(δ) denote the cluster resulting form expanding c by δ. Thus, if c is a ball of radius r, then c(δ) is a ball of
radius r + δ. For a set C of clusters, let
C(δ) = c(δ) c ∈ C ,
be the additive expansion operator; that is, C(δ) is a set of clusters resulting form expanding each cluster of C by δ.
Similarly,
(1 + ε)C = (1 + ε)c c ∈ C ,
is the multiplicative expansion operator, where (1 + ε)c is the cluster resulting from expanding c by a factor of (1 + ε). Namely, if
C is a set of balls, then (1 + ε)C is a set of balls, where a ball c ∈ C, corresponds to a ball radius (1 + ε) radius(c) in (1 + ε)C.
A set S ⊆ P is an (additive) ε-coreset of P, in relation to a price function radius, if for any clustering C of S, we have that P is
covered by C(ε radius(C)), where radius(C) = maxc∈C radius(c). Namely, we expand every cluster in the clustering by an ε-fraction
of the size of the largest cluster in the clustering. Thus, if C is a set of k balls, then C(ε f (C)) is just the set of balls resulting from
expanding each ball by εr, where r is the radius of the largest ball.
A set S ⊆ P is a multiplicative ε-coreset of P, if for any clustering C of S, we have that P is covered by (1 + ε)C.
Lemma 22.5.2 Let P be a set of n points in IRd , and ε > 0 a parameter. There exists an additive ε-coreset for the k-center problem,
and this coreset has O(k/εd ) points.
Proof: Let C denote the optimal clustering of P. Cover each ball of C by a grid of side length εropt /d, where ropt is the radius
of the optimal k-center clustering of P. From each such grid cell, pick one points of P. Clearly, the resulting point set S is of size
O(k/εd ) and it is an additive coreset of P.
The following is a minor extension of an argument used in [APV02].
d
Lemma 22.5.3 Let P be a set of n points
in IR , and ε > 0 a parameter. There exists a multiplicative ε-coreset for the k-center
problem, and this coreset has O k!/εdk points.
Proof: For k = 1, the additive coreset of P is also a multiplicative coreset, and it is of size O(1/εd ).
As in the proof of Lemma 22.5.2, we cover the point set by a grid of radius εropt /(5d), let SQ the set of cells (i.e., cubes) of
this grid which contains points of P. Clearly, |SQ| = O(k/εd ).
Let S be the additive ε-coreset of P. Let C be any k-center clustering of S, and let ∆ be any cell of SQ.
If ∆ intersects all the k balls of C, then one of them must be of radius at least (1 − ε/2)rd(P, k). Let c be this ball. Clearly, when
we expand c by a factor of (1 + ε) it would completely cover ∆, and as such it would also cover all the points of ∆ ∩ P.
Thus, we can assume that ∆ intersects at most k − 1 balls of C. As such, we can inductively compute an ε-multiplicative coreset
S
of P ∩ ∆, for k − 1 balls. Let Q∆ be this set, and let Q = S ∪ ∆∈SQ Q∆ .
Note that |Q| = T (k, ε) = O(k/εd )T (k − 1, ε) + O(k/εd ) = O k!/εdk . The set Q is the required multiplicative coreset by the
above argumentation.
180
22.6 Union of Cylinders
Let assume we want to cover P by k cylinders of minimum maximum radius (i.e., fit the points to k lines). Formally, consider G
to be the set of all cylinders in IRd , and let F = c1 ∪ c2 ∪ . . . ∪ ck c1 , . . . , ck ∈ F be the set, which its members are union of k
cylinders. For C ∈ F, let f (C) = maxc∈C radius(c). Let f opt (P) = minC∈F,P⊆C f (C).
One can compute the optimal cover of P by k cylinders in O(n(2d−1)k+1 ) time, see below for details. Furthermore, (IRd , F) has
VC dimension dimVC = O(dk log(dk)). Finally, one can show that this set of cylinders has ε-coreset of small size ???. Thus, we
would like to compute a small ε-coreset, and compute an approximation quickly.
Theorem 22.6.1 Let P be a set of n points in IRd . There exists a (additive) ε-coreset for P of size O(k/εd−1+k ) for covering the
points by k-cylinders of minimum radius.
181
22.7 Bibliographical notes
Section 22.3.2 is due to Clarkson [Cla93]. This technique was used to approximate terrains [AD97], and covering polytopes
[Cla93].
The observation that this argument can be used to speedup approximation algorithms is due to Agarwal et al. [APV02]. The
discussion of shell sets is implicit in the work of Bădoiu et al. [BHI02].
182
Chapter 23
Duality
I don’t know why it should be, I am sure; but the sight of another man asleep in bed when I am up, maddens me. It
seems to me so shocking to see the precious hours of a man’s life - the priceless moments that will never come back
to him again - being wasted in mere brutish sleep.
– – Jerome K. Jerome, Three men in a boat
Duality is a transformation that maps lines and points into points and lines, respectively, while preserving some properties in
the process. Despite its relative simplicity, it is a powerful tool that can dualize what seems like “hard” problems into easy dual
problems.
p = (a, b) ⇒ p? : y = ax − b
` : y = cx + d ⇒ `? = (c, −d).
We will consider a line ` ≡ y = cx + d to be a linear function in one dimension, and let `(x) = cx + d.
A point p = (a, b) lies above a line ` ≡ y = cx + d if p lies vertically above `. Formally, we have that b > `(a) = ca + d. We
will denote this fact by p `. Similarly, the point p lies below ` if b < `(a) = ca + d, denoted by p ≺ `.
A line ` supports a convex set S ⊆ IR2 if it intersects S but the interior of S lies completely on one side of `.
183
The missing lines. Consider the vertical line ` ≡ x = 0. Clearly, ` does not have a dual point (specifically, its hypothetical
dual point has an x coordinate with infinite value). In particular, our duality can not handle vertical lines. To visualize the problem,
consider a sequence of non-vertical lines `i that converges to a vertical line `. The sequence of dual points `i? is a sequence of points
that diverges to infinity.
23.1.1 Examples
23.1.1.1 Segments and Wedges
Consider a segment s = pq that lies on a line `. Observe, that the dual of a point r ∈ ` primal dual
is a line r? that passes through the point `? . In fact, the two lines p? and q? define two ` p?
double wedges. Let W be the double wedge that does not contain the vertical line that s q `?
passes through `? . W r?
Consider now the point r as it moves along s. When it is equal to p then its dual line r? r
p
is the line p? . Now, as r moves along the segment s the dual line r? rotates around `? , till it q?
?
arrives to q (and then r reaches q).
What about the other wedge? It represents the two rays forming ` \ s. The vertical line through `? represents the singularity
point in infinity where the two rays are “connected” together. Thus, as r travels along one of the rays (say starting at q) of ` \ s,
the dual line r? becomes steeper and steeper, till it becomes vertical. Now, the point r “jumps” from the “infinite endpoint” of the
ray, to the “infinite endpoint” of the other ray. Simultaneously, the line r? is continuing to rotate from its current vertical position,
sweeping over the whole wedge, till r travels back to p. (The reader that feels uncomfortable with notions line “infinite endpoint”
can rest assured that the author feels the same way. As such, this should be taken as an intuitive description of whats going on and
not a formally correct one.)
Lemma 23.1.1 Let L be a set of lines in the plane. Let α ∈ IR be an any number, β− = LL (α) and β+ = UL (α). Let p = (α, β− ) and
q = (α, β+ ). Then:
(i) the dual lines p? and q? are parallel, and they are both perpendicular to the direction (α, −1).
(ii) The lines p? and q? support CH(L? ).
(iii) The extent EL (α) is the vertical distance between the lines p? and q? .
Proof: (i) We have p? ≡ y = αx − β− and q? ≡ y = αx − β+ . These two lines are parallel since they have the same slope. In
particular, they are parallel to the direction (1, α). But this direction is perpendicular to the direction (α, −1).
(ii) By property (P2), we have that all the points of L? are below (or on) the line p? . Furthermore, since p is on the lower
envelope of L it follows that p? must pass through one of the points L? . Namely, p? supports CH(L? ) and it lies above it. Similar
argument applies to q? .
(iii) We have that EL (α) = β+ − β− . The vertical distance between the two parallel lines p? and q? is q? (0) − p? (0) =
−β − (−β− ) = β+ − β− , as required.
+
Thus, consider a vertex p of the upper envelope of the set of lines L. The point p is the ` p
intersection point of two lines ` and ~ of L. Consider the dual set of points L? and the dual CH(L? )
V
line p? . Since p lies above (or on) all the lines of L, by the above discussion, it must be ? ~?
that the line p? lies below (or on) all the points of L? . On the other hand, the line p? passes ~ `? p
through the two points `? and ~? . Namely, p? is a line that supports the convex hull of L? and it passes through two of its vertices.
184
The convex hull of L? is a convex polygon P , which can be broken into two convex chains by breaking upper convex chain
it at the two extreme points in the x direction. We will refer to the upper polygonal chain of the convex hull
as upper convex chain and to the other one as lower convex chain. In particular, two consecutive segments of
the upper envelope corresponds to two consecutive vertices on the lower chain of the convex hull of L? . Thus, q
p
the convex-hull of L? can be decomposed into two chains. The lower chain corresponds to the upper envelope
of L, and the upper chain corresponds to the lower envelope of L. Of special interest are the two x extreme lower convex chain
points p and q of the convex hull. They are the dual of the two lines with the highest/smallest slope in L (we
are assuming here that the slopes of lines in L are distinct). These two lines appear on both the upper and lower
envelope of the lines and they contain the four infinite rays of these envelopes.
q? p?
Lemma 23.1.2 Given a set L of n lines in the plane, one can compute its lower and upper envelopes in
O(n log n) time.
Proof: One can compute the convex hull of n points in the plane in O(n log n) time. Thus, computing the
convex hull of L? and dualizing the upper and lower chains of CH(L? ) results in the required envelopes.
In the following we would slightly abuse notations, and for a point p ∈ IRd we will refer to (p1 , . . . , pd−1 , LH (p)) as the point
LH (p). Similarly, UH (p) would denote the corresponding point on the upper envelope of H.
The proof of the following lemma is an easy extension of the 2d case.
23.3 Exercises
Exercise 23.3.1 Prove Lemma 23.2.1
Exercise 23.3.2 Show a counter example proving that no duality can preserve (exactly) orthogonal distances between points and
lines.
185
23.4 Bibliographical notes
The duality discussed here should not be confused with linear programming duality [Van97]. Although the two topics seems to be
connected somehow, the author is unaware of a natural and easy connection.
A natural question is whether one can find a duality that preserves the orthogonal distances between lines and points. The
surprising answer is no, as Exercise 23.3.2 testifies. In fact, it is not too hard to show using topological arguments that any duality
must distort such distances arbitrarily bad [FH06].
Open Problem 23.4.1 Given a set P of n points in the plane, and a set L of n lines in the plane, consider the best possible duality
(i.e., the one that minimizes the distortion of orthogonal distances) for P and L. What is the best distortion possible, as a function
of n?
Here, we define the distortion of the duality as
!
d(p, `) d(p? , `? )
max , .
p∈P,`∈L d(p? , `? ) d(p, `)
A striking (negative) example of the power of duality is the work of Overmars and van Leeuwen [OvL81] on the dynamic
maintenance of convex hull in 2d, and the maintenance of the lower/upper envelope of lines in the plane. Clearly, by duality, the
two problems are identical. However, the authors (smart people indeed) did not observe it, and the paper is twice longer than it
should be solving the two problems separately.
Duality is heavily used throughout computational geometry, and it is hard to imagine managing without it. Results and
techniques that use duality include bounds on k-sets/k-levels [Dey98], partition trees [Mat92], and coresets for extent measure
[AHV04] (this is a random short list of relevant results and it is by no means exhaustive).
186
to the hyperboloid, to the computation of the convex hull of the points. In fact, the projection down of the lower part of the convex
hull is the required triangulation. Thus, the two structures are dual to each other also lifting/linearization and direct duality. The
interested reader should check out [dBvKOS00].
187
188
Chapter 24
For example, IR2 with the regular Euclidean distance is a metric space.
n It is usually of interest to consider the finite case, where X is an a set of n points. Then, the function d can be specified by
2
real numbers; that is, the distance between every pair of points of X. Alternatively, one can think about (X, d) is a weighted
complete graph, where we specify positive weights on the edges, and the resulting weights on the edges comply with the triangle
inequality.
In fact, finite metric spaces rise naturally from (sparser) graphs. Indeed, let G = (X, E) be an undirected weighted graph
defined over X, and let dG (x, y) be the length of the shortest path between x and y in G. It is easy to verify that (X, dG ) is a finite
metric space. As such if the graph G is sparse, it provides a compact representation to the finite space (X, dG ).
Definition 24.1.2 Let (X, d) be an n-point metric space. We denote the open ball of radius r about x ∈ X, by b(x, r) = y ∈ X d(x, y) < r .
Underling our discussion of metric spaces are algorithmic applications. The hardness of various computational problems
depends heavily on the structure of the finite metric space. Thus, given a finite metric space, and a computational task, it is natural
to try to map the given metric space into a new metric where the task at hand becomes easy.
Example 24.1.3 Consider the problem of computing the diameter, while it is not trivial in two dimensions, it is easy in one
dimension. Thus, if we could map points in two dimensions into points in one dimension, such that the diameter is preserved,
then computing the diameter becomes easy. In fact, this approach yields an efficient approximation algorithm, see Exercise 24.7.3
below.
Of course, this mapping from one metric space to another is going to introduce error. We would be interested in minimizing
the error introduced by such a mapping.
Definition 24.1.4 Let (X, dX ) and (Y, dY ) be metric spaces. A mapping f : X → Y is called an embedding, and is C-Lipschitz if
dY ( f (x), f (y)) ≤ C · dX (x, y) for all x, y ∈ X. The mapping f is called K-bi-Lipschitz if there exists a C > 0 such that
CK −1 · dX (x, y) ≤ dY ( f (x), f (y)) ≤ C · dX (x, y),
for all x, y ∈ X.
The least K for which f is K-bi-Lipschitz is called the distortion of f , and is denoted dist( f ). The least distortion with which
X may be embedded in Y is denoted cY (X).
There are several powerful results in this vain, that show the existence of embeddings with low distortion. These include:
(A) Probabilistic trees. Every finite metric can be randomly embedded into a tree such that the “expected” distortion for a specific
pair of points is O(log n).
(B) Embedding into Euclidean space. Any n-point metric space can be embedded into (finite dimensional) Euclidean space with
O(log n) distortion.
(C) Dimension reduction. Any n-point set in Euclidean space with the regular Euclidean distance can be embedded into IRk with
distortion (1 + ε), where k = O(ε−2 log n).
189
24.2 Examples
What is distortion? When considering a mapping f : X → IRd of a metric space (X, d) to IRd , it would useful to observe
that since IRd can be scaled, we can consider f to be an expansive mapping (i.e., no distances shrink). Furthermore, we can in fact
kx−yk
assume that there is at least one pair of points x, y ∈ X, such that d(x, y) = kx − yk. As such, we have dist( f ) = max x,y d(x,y) .
Why distortion is necessary? Consider the a graph G = (V, E) with one vertex s connected b
s
to three other vertices a, b, c, where the weights on the edges are all one (i.e., G is the star√graph with a
three leaves). We claim that G can not be embedded into Euclidean space with distortion ≤ 2. Indeed,
consider the associated metric space (V, dG ) and an (expansive) embedding f : V → IRd .
Let 4 denote the triangle formed by a0 b0 c0 , where a0 = f (a), b0 = f (b) and c0 = f (c). Next, consider c
the following quantity max(ka0 − s0 k , kb0 − s0 k , kc0 − s0 k) which lower bounds the distortion of f . This
quantity is minimized when r = ka0 − s0 k = kb0 − s0 k = kc0 − s0 k. Namely, s0 is the center of the smallest enclosing circle of
4. However, r is√minimized when all the edges of 4 are of equal length, and are in fact of length dG (a, b) = 2. It follows that
dist( f ) ≥ r = 2/ 3.
It is known that Ω(log n) distortion is necessary in the worst case when embedding a graph into euclidean space. This is shown
using expanders [Mat02].
Definition 24.2.1 Hierarchically well-separated tree (HST) is a metric space defined on the leaves of a rooted tree T . To each
vertex u ∈ T there is associated a label ∆u ≥ 0. This label is zero for all the leaves of T , and it is a positive number for all the
interior nodes. The labels are such that if a vertex u is a child of a vertex v then ∆u ≤ ∆v . The distance between two leaves x, y ∈ T
is defined as ∆lca(x,y) , where lca(x, y) is the least common ancestor of x and y in T .
A HST T is a k-HST if for a vertex v ∈ T , we have that ∆v ≤ ∆p(v) /k, where p(v) is the parent of v in T .
Note that a HST is a very limited metric. For example, consider the cycle G = Cn of n vertices, with weight one on the edges,
and consider an expansive embedding f of G into a HST H. It is easy to verify, that there must be two consecutive nodes of the
cycle, which are mapped to two different subtrees of the root r of H. Since H is expansive, it follows that ∆r ≥ n/2. As such,
dist( f ) ≥ n/2. Namely, HSTs fail to faithfully represent even very simple metrics.
24.2.2 Clustering
One natural problem we might want to solve on a graph (i.e., finite metric space) (X, d) is to partition it into clusters. One such
natural clustering is the k-median clustering, where we would like to choose a set C ⊆ X of k centers, such that
X
νC (X, d) = d(q, C)
q∈X
is minimized, where d(q, C) = minc∈C d(q, c) is the distance of q to its closest center in C.
It is known that finding the optimal k-median clustering in a (general weighted) graph is NP-complete. As such, the best we
can hope for is an approximation algorithm. However, if the structure of the finite metric space (X, d) is simple, then the problem
can be solved efficiently. For example, if the points of X are on the real line (and the distance between a and b is just |a − b|), then
k-median can be solved using dynamic programming.
Another interesting case is when the metric space (X, d) is a HST. Is not too hard to prove the following lemma. See Exer-
cise 24.7.1.
Lemma 24.2.2 Let (X, d) be a HST defined over n points, and let k > 0 be an integer. One can compute the optimal k-median
clustering of X in O(k2 n) time.
Thus, if we can embed a general graph G into a HST H, with low distortion, then we could approximate the k-median clustering
on G by clustering the resulting HST, and “importing” the resulting partition to the original space. The quality of approximation,
would be bounded by the distortion of the embedding of G into H.
190
24.3 Random Partitions
Let (X, d) be a finite metric space. Given a partition P = {C1 , . . . , Cm } of X, we refer to the sets Ci as clusters. We write PX for the
set of all partitions of X. For x ∈ X and a partition P ∈ PX we denote by P(x) the unique cluster of P containing x. Finally, the set
of all probability distributions on PX is denoted DX .
What we want, and what we can get. The target is to partition the metric space into clusters, such that each cluster
would have diameter at most ∆, for some prespecified parameter ∆.
We would like to have a partition that does not disrupt distances too “badly”. Intuitively, that means that a pair of points that
is in distance larger than ∆ will be separated by the clustering, but points that are closer to each other would be in the same cluster.
This is of course impossible, as any clustering must separate points that are close to each other. To see that, consider a set of points
densely packed on the interval [0, 10], and let ∆ < 5. Clearly, there would always be two close points that would be in two separate
clusters.
As such, our strategy would be to use partitions that are constructed randomly, and the best we can hope for, is that the
probability of them being separated is a function of their distance t, which would be small if t is small. As an example, for the
case of points on the real line, take the natural partition into intervals of length ∆ (that is, all the points in the interval [i∆, (i + 1)∆)
would belong to the same cluster), and randomly shift it by a random number x picked uniformly in [0, ∆). Namely, all the points
belonging to [x + i∆, x + (i + 1)∆) would belong to the same cluster. Now, it is easy to verify that for any two points p, q ∈ IR, of
distance t = |p − q| from each other, the probability that they are in two different intervals is bounded by t/∆ (see Exercise 24.7.4).
And intuitively, this is the best one can hope for.
As such, the clustering scheme we seek should separate two points in distance t from each other with probability (t/∆) ∗ noise,
where noise is hopefully small.
24.3.2 Properties
Lemma 24.3.1 Let (X, d) be a finite metric space, ∆ = 2u a prescribed parameter, and let P be the partition of X generated by the
above random partition. Then the following holds:
(i) For any C ∈ P, we have diam(C) ≤ ∆.
(ii) Let x be any point of X, and t a parameter ≤ ∆/8. For B = b(x, t), we have that
8t M
Pr[B * P(x)] ≤ ln ,
∆ m
where M = |b(x, ∆/8)|, m = |b(x, ∆)|.
Proof: Since Cy ⊆ b(y, R), we have that diam(Cy ) ≤ ∆, and thus the first claim holds.
Let U be the set of points w ∈ b(x, ∆) such that b(w, R) ∩ B , ∅. Arrange the points of U in increasing distance from x, and
let w1 , . . . , w M0 denote the resulting order, where M 0 = |U|. For k = 1, . . . , M 0 , let Ik = [d(x, wk ) − t, d(x, wk ) + t] and write Ek for
the event that wk is the first point in π such that B ∩ Cwk , ∅, and yet B * Cwk . Observe that if B * P(x) then one of he events
E1 , . . . , E M0 must occur.
Note that if wk ∈ b(x, ∆/8), then Pr[Ek ] = 0 since t ≤ ∆/8 and B = b(x, t) ⊆ b(x, ∆/8) ⊆ b(wk , ∆/4) ⊆ b(wk , R). Indeed, when
we “scoop” out the cluster Cwk , either B would be fully contained inside Cwk , or alternatively, if B is not fully contained inside Cwk ,
then some parts of B were already “scooped out” by some other point of U, and as such Ek does not happen.
191
In particular, w1 , . . . , wm are inside b(x, ∆/8) and as such Pr[E1 ] = · · · = Pr[Ea ] = 0. Also, note that if d(x, wk ) < R − t then
b(wk , R) contains B and as such Ek can not happen. Similarly, if d(x, wk ) > R + t then b(wk , R) ∩ B = ∅ and Ek can not happen. As
such, if Ek happen then R − t ≤ d(x, wk ) ≤ R + t. Namely, if Ek happen then R ∈ Ik . We conclude that
Pr[Ek ] = Pr[Ek ∩ (R ∈ Ik )] = Pr[R ∈ Ik ] · Pr[Ek | R ∈ Ik ] .
Now, R is uniformly distributed in the interval [∆/4, ∆/2], and Ik is an interval of length 2t. Thus, Pr[R ∈ Ik ] ≤ 2t/(∆/4) = 8t/∆.
Next, to bound Pr[Ek | R ∈ Ik ], we observe that w1 , . . . , wk−1 are closer to x than wk and their distance to b(x, t) is smaller than
R. Thus, if any of them appear before wk in π then Ek does not happen. Thus, Pr[Ek | R ∈ Ik ] is bounded by the probability that wk
is the first to appear in π out of w1 , . . . , wk . But this probability is 1/k, and thus Pr[Ek | R ∈ Ik ] ≤ 1/k.
We are now ready for the kill. Indeed,
M0
X M0
X M0
X
Pr[B * P(x)] = Pr[Ek ] = Pr[Ek ] = Pr[R ∈ Ik ] · Pr[Ek | R ∈ Ik ]
k=1 k=m+1 k=m+1
XM0
8t 1 8t M 0 8t M
≤ · ≤ ln ≤ ln ,
k=m+1
∆ k ∆ m ∆ m
PM0 1
R M 0 dx M0
since k=m+1 k ≤ m x
= ln m
and M 0 ≤ M.
Theorem 24.4.1 Given n-point metric (X, d) one can randomly embed it into a 2-HST with probabilistic distortion ≤ 24 ln n.
Proof: The construction is recursive. Let diam(P), and compute a random partition of X with cluster diameter diam(P)/2,
using the construction of Section 24.3.1. We recursively construct a 2-HST for each cluster, and hang the resulting clusters on the
root node v, which is marked by ∆v = diam(P). Clearly, the resulting tree is a 2-HST.
For a node v ∈ T , let X(v) be the set of points of X contained in the subtree of v.
For the analysis, assume diam(P) = 1, and consider two points x, y ∈ X. We consider a node v ∈ T to be in level i if
level(v) = lg ∆v = i. The two points x and y correspond to two leaves in T , and let b u be the least common ancestor of x and y in t.
We have dT (x, y) ≤ 2level(v) . Furthermore, note that along a path the levels are strictly monotonically increasing.
In fact, we are going to be conservative, and let w be the first ancestor of x, such that b = b(x, d(x, y)) is not completely
u . Thus, dT (x, y) ≤ 2level(w) .
contained in X(u1 ), . . . , X(um ), where u1 , . . . , um are the children of w. Clearly, level(w) > level b
Consider the path σ from the root of T to x, and let Ei be the event that b is not fully contained in X(vi ), where vi is the node
of σ of level i (if such a node exists). Furthermore, let Yi be the indicator variable which is 1 if Ei is the first to happened out of the
P i
sequence of events E0 , E−1 , . . .. Clearly, dT (x, y) ≤ Yi 2.
Let t = d(x, y) and j = lg d(x, y) , and ni = b(x, 2i ) for i = 0, . . . , −∞. We have
X X h i X
0 0 0
8t ni
E dT (x, y) ≤ E[Yi ] 2i ≤ 2i Pr Ei ∩ Ei−1 ∩ Ei−1 · · · E0 ≤ 2i · i ln ,
i= j i= j i= j
2 ni−3
192
Theorem 24.4.2 Let (X, d) be a n-point metric space. One can compute in polynomial time a k-median clustering of X which has
expected price O α log n , where α is the price of the optimal k-median clustering of (X, d).
Proof: The algorithm is described above, and the fact that its running time is polynomial can be easily be verified. To prove
the bound on the quality of the clustering, for any point p ∈ X, let c(p) denote the closest point in Copt to p according to d, where
Copt is the set of k-medians in the optimal clustering. Let C be the set of k-medians returned by the algorithm, and let H be the HST
used by the algorithm. We have
X X
β = νC (X, d) ≤ νC (X, dH ) ≤ νCopt (X, dH ) ≤ dH (p, Copt ) ≤ dH (p, c(p)).
p∈X p∈X
Proof: Indeed, let x0 and y0 be the closet points of Y, to x and y, respectively. Observe that f (x) = d(x, x0 ) ≤ d(x, y0 ) ≤
d(x, y) + d(y, y0 ) = d(x, y) + f (y) by the triangle inequality. Thus, f (x) − f (y) ≤ d(x, y). By symmetry, we have f (y) − f (x) ≤ d(x, y).
Thus, | f (x) − f (y)| ≤ d(x, y).
Theorem
√ 24.5.2 Given a n-point metric Y = (X, d), with spread Φ, one can embed it into Euclidean space IRk with distortion
O( ln Φ ln n), where k = O(ln Φ ln n).
Proof: Assume that diam(Y) = Φ (i.e., the smallest distance in Y is 1), and let ri = 2i−2 , for i = 1, . . . , α, where α = lg Φ . Let
Pi, j be a random partition of P with diameter ri , using Theorem 24.4.1, for i = 1, . . . , α and j = 1, . . . , β, where β = c log n and c
is a large enough constant to be determined shortly.
For each cluster of Pi, j randomly toss a coin, and let Vi, j be the all the points of X that belong to clusters in Pi, j that got ’T ’ in
their coin toss. For a point u ∈ x, let fi, j (x) = d(x, X \ Vi, j ) = minv∈X\Vi, j d(x, v), for i = 0, . . . , m and j = 1, . . . , β. Let F : X →
IR(m+1)·β be the embedding, such that F(x) = f0,1 (x), f0,2 (x), . . . , f0,β (x), f1,1 (x), f0,2 (x), . . . , f1,β (x), . . . , fm,1 (x), fm,2 (x), . . . , fm,β (x) .
Next, consider two points x, y ∈ X, with distance φ = d(x, y). Let k be an integer such that ru ≤ φ/2 ≤ ru+1 . Clearly, in any
partition of Pu,1 , . . . , Pu,β the points x and y belong to different clusters. Furthermore, with probability half x ∈ Vu, j and y < Vu, j or
x < Vu, j and y ∈ Vu, j , for 1 ≤ j ≤ β.
Let E j denote the event that b(x, ρ) ⊆ Vu, j and y < Vu, j , for j = 1, . . . , β, where ρ = φ/(64 ln n). By Lemma 24.3.1, we have
h i 8ρ φ
Pr b(x, ρ) * Pu, j (x) ≤ ln n ≤ ≤ 1/2.
ru 8ru
Thus,
h i
Pr E j = Pr b(x, ρ) ⊆ Pu, j (x) ∩ x ∈ Vu, j ∩ y < Vu, j
h i h i h i
= Pr b(x, ρ) ⊆ Pu, j (x) · Pr x ∈ Vu, j · Pr y < Vu, j ≥ 1/8,
since those three events are independent. Notice, that if E j happens, than fu, j (x) ≥ ρ and fu, j (y) = 0. hP i
P
Let X j be an indicator variable which is 1 if Ei happens, for j = 1, . . . , β. Let Z = j X j , and we have µ = E[Z] = E j X j ≥
β/8. Thus, the probability that E1 , . . . , Eβ happens, is Pr[Z < (1 − 1/2) E[Z]]. By the Chernoff inequality, we have
only β/16 of
Pr[Z < (1 − 1/2) E [Z]] ≤ exp −µ1/(2 · 2 ) = exp(−β/64) ≤ 1/n10 , if we set c = 640.
2
193
Thus, with high probability
v
u
t r
X
β
2 √
β p ρ β
kF(x) − F(y)k ≥ fu, j (x) − fu, j (y) ≥ ρ2 = β =φ· .
j=1
16 4 256 ln n
On the other hand, fi, j (x) − fi, j (y) ≤ d(x, y) = φ ≤ 64ρ ln n. Thus,
q p p
kF(x) − F(y)k ≤ αβ(64ρ ln n)2 ≤ 64 αβρ ln n = αβ · φ.
Thus, setting G(x) = F(x) 256√lnβ n , we get a mapping that maps two points of distance φ from each other to two points with
√ √ √
distance in the range φ, φ · αβ · 256√lnβ n . Namely, G(·) is an embedding with distortion O( α ln n) = O( ln Φ ln n).
The probability that G fails on one of the pairs, is smaller than (1/n10 ) · n2 < 1/n8 . In particular, we can check the distortion
of G for all n2 pairs, and if any of them fail (i.e., the distortion is too big), we restart the process.
p
Thus, kF(x) − F(y)k ≤ φ 5β lg n. We conclude, that with high probability, F(·) is an embedding of X into Euclidean space with
p √
distortion φ 5β lg n / φ · 256 lnβ n = O(log3/2 n).
¯
Indeed, if fi, j (x) < di (x, Vi, j ) and fi, j (y) < di (x, Vi, j ) then fi, j (x) = 2ri and fi, j (y) = 2ri , which implies the above inequality. If
fi, j (x) = di (x, Vi, j ) and fi, j (y) = di (x, Vi, j ) then the inequality trivially holds. The other option is handled in a similar fashion.
194
We still have to handle the infinite number of coordinates problem. However, the above proof shows that we care about a
resolution ri (i.e., it contributes to the estimates in the above proof) only if there is a pair x and y such that ri /n2 ≤ d(x, y) ≤ ri n2 .
Thus, for every pair of distances there are O(log n) relevant resolutions. Thus, there are at most η = O(n2 β log n) = O(n2 log2 n)
relevant coordinates, and we can ignore all the other coordinates. Next, consider the affine subspace h that spans F(P). Clearly, it
is n − 1 dimensional, and consider the projection G : IRη → IRn−1 that projects a point to its closest point in h. Clearly, G(F(·)) is an
embedding with the same distortion for P, and the target space is of dimension n − 1.
Note, that all this process succeeds with high probability. If it fails, we try again. We conclude:
Theorem 24.5.3 (Low quality Bourgain theorem.) Given a n-point metric M, one can embed it into Euclidean space of dimen-
sion n − 1, such that the distortion of the embedding is at most O(log3/2 n).
Using the Johnson-Lindenstrauss lemma, the dimension can be further reduced to O(log n). In fact, being more careful in the
proof, it is possible to reduce the dimension to O(log n) directly.
24.7 Exercises
Exercise 24.7.1 (Clustering for HST.) [4 Points]
Let (X, d) be a HST defined over n points, and let k > 0 be an integer. Provide an algorithm that computes the optimal k-median
clustering of X in O(k2 n) time.
[Hint: Transform the HST into a tree where every node has only two children. Next, run a dynamic programming algorithm
on this tree.]
(A) [2 Points] Give a counter example to the following claim: Let (X, d) be a metric space, and let P be a partition of X. Then,
the pair (P, d0 ) is a metric, where d0 (C, C 0 ) = d(C, C 0 ) = min x∈C,y∈C 0 d(x, y) and C, C 0 ∈ P.
(B) [8 Points] Let (X, d) be a n-point metric space, and consider the set U = i 2i ≤ d(x, y) ≤ 2i+1 , for x, y ∈ X . Prove that
|U| = O(n). Namely, there are only n different resolutions that “matter” for a finite metric space.
(A) [1 Points] Let ` be a line in the plane, and consider the embedding f : IR2 → `, which is the projection of the plane into `.
Prove that f is 1-Lipschitz, but it is not K-bi-Lipschitz for any constant K.
√
(B) [3 Points] Prove that one can find a family of projections F of size O(1/ ε), such that for any two points x, y ∈ IR2 , for one of
the projections f ∈ F we have d( f (x), f (y)) ≥ (1 − ε)d(x, y).
√
(C) [1 Points] Given a set P of n in the plane, given a O(n/ ε) time algorithm that outputs two points x, y ∈ P, such that
d(x, y) ≥ (1 − ε)diam(P), where diam(P) = maxz,w∈P d(z, w) is the diameter of P.
195
(D) [2 Points] Given P, show how to extract, in O(n) time, a set Q ⊆ P of size O(ε−2 ), such that diam(Q) ≥ (1 − ε/2)diam(P).
(Hint: Construct a grid of appropriate resolution.)
In particular, give an (1 − ε)-approximation algorithm to the diameter of P that works in O(n + ε−2.5 ) time. (There are slightly
faster approximation algorithms known for approximating the diameter.)
(A) [1 Points] For a real number ∆ > 0, and a random number x ∈ [0, ∆] consider the random partition of the real line into
intervals of length ∆, such that all the points falling into the interval [i∆, i∆ + x) are in the same cluster. Prove, that for two
points, p, q ∈ IR, the probability of p and q to be in two different clusters is at most |p − q| /∆.
(B) [3 Points] Consider the d dimensional grid of sidelength ∆, and let p be a random vector in the hypercube [0, ∆]d . Shift the
grid by p, and consider the partition of IRd induced by this grid. Formally, the space is partitioned into clusters, where all the
points inside the cube p + [0, ∆)d are one cluster. Consider any two points q, r ∈ IRd . Prove, that the probability that q and r are
in different clusters is bounded by d kq − rk /∆.
√
(C) [6 Points] Strengthen (B), by showing that the probability is bounded by d kq − rk /∆. [Hint: Consider the distance t =
kq − rk to be fixed, and figure out what is the worst case for this partition.]
√
Part (C) implies that we can partition space into clusters with diameter ∆0 = d∆ such that the probability of points in distance
t from each other to be separated is bounded by dt/∆0 .
Acknowledgments
The presentation in this write-up follows closely the insightful suggestions of Manor Mendel.
196
Chapter 25
Tail Inequalities
"Wir müssen wissen, wir werden wissen" (We must know, we shall know)
—– David Hilbert
E[X]
≥ 0 + t0 · Pr[X ≥ t0 ] > 0 + t0 · = E[X] ,
t0
a contradiction.
Theorem 25.1.2 (Chebychev inequality) Let X be a random variable with µ x = E[X] and σ x be the standard deviation of X. That
h i 1
is σ2X = E (X − µ x )2 . Then, Pr |X − µX | ≥ tσX ≤ 2 .
t
Proof: Note that h i
Pr |X − µX | ≥ tσX = Pr (X − µX )2 ≥ t2 σ2X .
Set Y = (X − µX )2 . Clearly, E Y = σ2X . Now, apply Markov inequality to Y.
197
Proof: Clearly, for an arbitrary t, to specified shortly, we have
E exp(tY)
Pr[Y ≥ ∆] = Pr exp(tY) ≥ exp(t∆) ≤ ,
exp(t∆)
the first part follows by the fact that exp(·) preserve ordering, and the second part follows by the Markov inequality.
Observe that
1 t 1 −t et + e−t
E exp(tXi ) = e + e =
2 2 2
!
1 t t2 t3
= 1+ + + + ···
2 1! 2! 3!
!
1 t t2 t3
+ 1− + − + ···
2 1! 2! 3!
!
t2 t2k
= 1+ + + +··· + + ··· ,
2! (2k)!
by the Taylor expansion of exp(·). Note, that (2k)! ≥ (k!)2k , and thus
!i
X t2i X t2i X 1 t2
∞ ∞ ∞
E exp(tXi ) = ≤ = = exp t2 /2 ,
i=0
(2i)! i=0
2i (i!)
i=0
i! 2
again, by the Taylor expansion of exp(·). Next, by the independence of the Xi s, we have
X Y Yn
E exp(tY) = Eexp tXi = E exp(tXi ) = E exp(tXi )
i i i=1
Y
n
2 /2 2 /2
≤ et = ent .
i=1
We have
exp nt2 /2
Pr[Y ≥ ∆] ≤ = exp nt2 /2 − t∆ .
exp(t∆)
Next, by minimizing the above quantity for t, we set t = ∆/n. We conclude,
! !
n ∆ 2 ∆ ∆2
Pr[Y ≥ ∆] ≤ exp − ∆ = exp − .
2 n n 2n
Corollary 25.2.2 Let X1 , . . . , Xn be n independent random variables, such that Pr[Xi = 1] = Pr[Xi = −1] = 21 , for i = 1, . . . , n. Let
P
Y = ni=1 Xi . Then, for any ∆ > 0, we have
2
Pr[|Y| ≥ ∆] ≤ 2e−∆ /2n .
1
Corollary 25.2.3 Let X1 , . . . , Xn be n independent coin flips, such that Pr[Xi = 0] = Pr[Xi = 1] = , for i = 1, . . . , n. Let
P 2
Y = ni=1 Xi . Then, for any ∆ > 0, we have
n
Pr Y − ≥ ∆ ≤ 2e−2∆ /n .
2
Remark 25.2.4 Before going any further, it is might be instrumental to understand what this inequalities
√ √ imply. Consider then
case where Xi is either zero or one with probability half. In this case µ = E[Y] = n/2. Set δ = t n ( µ is approximately the
standard deviation of X if pi = 1/2). We have by
n
√
Pr Y − ≥ ∆ ≤ 2 exp −2∆2 /n = 2 exp −2(t n)2 /n = 2 exp −2t2 .
2
Thus, Chernoff inequality implies exponential decay (i.e., ≤ 2−t ) with t standard deviations, instead of just polynomial (like the
Chebychev’s inequality).
198
25.2.2 The Chernoff Bound — General Case
Here we present the Chernoff bound in a more general settings.
Example 25.2.8 Arkansas Aardvarks win a game with probability 1/3. What is their probability to have a winning season with n
games. By Chernoff inequality, this probability is smaller than
" 1/2 #n/3
e
F + (n/3, 1/2) = = (0.89745)n/3 = 0.964577n .
1.51.5
For n = 40, this probability is smaller than 0.236307. For n = 100 this is less than 0.027145. For n = 1000, this is smaller than
2.17221 · 10−16 (which is pretty slim and shady). Namely, as the number of experiments is increases, the distribution converges to
its expectation, and this converge is exponential.
199
Values Probabilities Inequality Ref
2
−1, +1 Pr[Xi = −1] = Pr[Y ≥ ∆] ≤ e−∆ /2n Theorem 25.2.1
1 2
Pr[Xi = 1] = 2 Pr[Y ≤ −∆] ≤ e−∆ /2n Theorem 25.2.1
2
Pr[|Y| ≥ ∆] ≤ 2e−∆ /2n Corollary 25.2.2
Pr[Xi = 0] = h i
Pr Y − n2 ≥ ∆ ≤ 2e−2∆ /n
2
0, 1 1 Corollary 25.2.3
Pr[Xi = 1] = 2
Pr[Xi = 0] = 1 − pi eδ µ
0,1 Pr Y > (1 + δ)µ < (1+δ) Theorem 25.2.6
Pr[Xi = 1] = pi 1+δ
For δ ≤ 2e − 1 Pr Y > (1 + δ)µ < exp −µδ2 /4 Theorem 25.2.6
δ ≥ 2e − 1 Pr Y > (1 + δ)µ < 2−µ(1+δ)
For δ ≥ 0 Pr Y < (1 − δ)µ < exp −µδ2 /2 Theorem 25.2.9
P
Table 25.1: Summary of Chernoff type inequalities covered. Here we have n variables X1 , . . . , Xn , Y = i Xi
and µ = E[Y].
2
Definition 25.2.10 F − (µ, δ) = e−µδ /2 .
∆− (µ, ε) - what should be the value of δ, so that the probability is smaller than ε.
s
− 2 log 1/ε
∆ (µ, ε) =
µ
For large δ:
log2 (1/ε)
∆+ (µ, ε) < −1
µ
for δ ≤ 1/2.
200
25.4 Exercises
Exercise 25.4.1 (Simpler Tail Inequality.) [1 Points]
[2 Points] Prove that for δ > 2e − 1, we have
e (1+δ)µ
F + (µ, δ) < ≤ 2−(1+δ)µ .
1+δ
201
202
Bibliography
[AACS98] P. K. Agarwal, B. Aronov, T. M. Chan, and M. Sharir. On levels in arrangements of lines, segments, planes, and
triangles. Discrete Comput. Geom., 19:315–331, 1998.
[AB99] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge, 1999.
[AC07] P. Afshani and T. M. Chan. On approximate range counting and depth. In Proc. 23rd Annu. ACM Sympos. Comput.
Geom., pages 337–343, 2007.
[Ach01] D. Achlioptas. Database-friendly random projections. In Proc. 20th ACM Sympos. Principles Database Syst., pages
274–281, 2001.
[ACNS82] M. Ajtai, V. Chvátal, M. Newborn, and E. Szemerédi. Crossing-free subgraphs. Ann. Discrete Math., 12:9–12, 1982.
[AD97] P. K. Agarwal and P. K. Desikan. An efficient algorithm for terrain simplification. In Proc. 8th ACM-SIAM Sympos.
Discrete Algorithms, pages 139–147, 1997.
[AEIS99] A. Amir, A. Efrat, P. Indyk, and H. Samet. Efficient algorithms and regular data structures for dilation, location and
proximity problems. In Proc. 40th Annu. IEEE Sympos. Found. Comput. Sci., pages 160–170, 1999.
[AG86] N. Alon and E. Győri. The number of small semispaces of a finite set of points in the plane. J. Combin. Theory Ser.
A, 41:154–157, 1986.
[AGK+ 01] V. Arya, N. Garg, R. Khandekar, K. Munagala, and V. Pandit. Local search heuristic for k-median and facility
location problems. In Proc. 33rd Annu. ACM Sympos. Theory Comput., pages 21–29, 2001.
[AH05] B. Aronov and S. Har-Peled. On approximating the depth and related problems. In Proc. 16th ACM-SIAM Sympos.
Discrete Algorithms, pages 886–894, 2005.
[AHK06] S. Arora, E. Hazan, and S. Kale. Multiplicative weights method: a meta-algorithm and its applications. manuscript.
Available from , 2006.
[AHS07] B. Aronov, S. Har-Peled, and M. Sharir. On approximate halfspace range counting and relative epsilon-
approximations. In Proc. 23rd Annu. ACM Sympos. Comput. Geom., pages 327–336, 2007.
[AHV04] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measures of points. J. Assoc. Comput.
Mach., 51(4):606–635, 2004.
[AHY07] P. Agarwal, S. Har-Peled, and H. Yu. Embeddings of surfaces, curves, and moving points in euclidean space. In
Proc. 23rd Annu. ACM Sympos. Comput. Geom., pages 381–389, 2007.
[AKPW95] N. Alon, R. M. Karp, D. Peleg, and D. West. A graph-theoretic game and its application to the k-server problem.
SIAM J. Comput., 24(1):78–100, February 1995.
[AM94] P. K. Agarwal and J. Matoušek. On range searching with semialgebraic sets. Discrete Comput. Geom., 11:393–418,
1994.
[AM98] S. Arya and D. Mount. ANN: library for approximate nearest neighbor searching. http://www.cs.umd.edu/
~mount/ANN/, 1998.
[AM02] S. Arya and T. Malamatos. Linear-size approximate Voronoi diagrams. In Proc. 13th ACM-SIAM Sympos. Discrete
Algorithms, pages 147–155, 2002.
[AM04] S. Arya and D. M. Mount. Computational geometry: Proximity and location. In D. Mehta and S. Sahni, editors,
Handbook of Data Structures and Applications, chapter 63. CRC Press LLC, Boca Raton, FL, 2004. to appear.
[Ame94] N. Amenta. Helly-type theorems and generalized linear programming. Discrete Comput. Geom., 12:241–261, 1994.
[AMM02] S. Arya, T. Malamatos, and D. M. Mount. Space-efficient approximate Voronoi diagrams. In Proc. 34th Annu. ACM
Sympos. Theory Comput., pages 721–730, 2002.
203
[AMN+ 98] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest
neighbor searching in fixed dimensions. J. Assoc. Comput. Mach., 45(6), 1998.
[AMS94] P. K. Agarwal, J. Matoušek, and O. Schwarzkopf. Computing many faces in arrangements of lines and segments. In
Proc. 10th Annu. ACM Sympos. Comput. Geom., pages 76–84, 1994.
[APV02] P. K. Agarwal, C. M. Procopiuc, and K. R. Varadarajan. Approximation algorithms for k-line center. In Proc. 10th
Annu. European Sympos. Algorithms, pages 54–63, 2002.
[Aro98] S. Arora. Polynomial time approximation schemes for euclidean TSP and other geometric problems. J. Assoc.
Comput. Mach., 45(5):753–782, Sep 1998.
[AS00] N. Alon and J. H. Spencer. The probabilistic method. Wiley Inter-Science, 2nd edition, 2000.
[Aur91] F. Aurenhammer. Voronoi diagrams: A survey of a fundamental geometric data structure. ACM Comput. Surv.,
23:345–405, 1991.
[Bal97] K. Ball. An elementary introduction to modern convex geometry. In Flavors of geometry, volume MSRI Publ. 31.
Cambridge Univ. Press, 1997. http://www.msri.org/publications/books/Book31/files/ball.pdf.
[Bar96] Y. Bartal. Probabilistic approximations of metric space and its algorithmic application. In Proc. 37th Annu. IEEE
Sympos. Found. Comput. Sci., pages 183–193, October 1996.
[Bar98] Y. Bartal. On approximating arbitrary metrices by tree metrics. In Proc. 30th Annu. ACM Sympos. Theory Comput.,
pages 161–168, 1998.
[Bar02] A. Barvinok. A course in convexity, volume 54 of Graduate Studies in Mathematics. American Mathematical
Society, Providence, RI, 2002.
[BEG94] M. Bern, D. Eppstein, and J. Gilbert. Provably good mesh generation. J. Comput. Syst. Sci., 48:384–409, 1994.
[BH01] G. Barequet and S. Har-Peled. Efficiently approximating the minimum-volume bounding box of a point set in three
dimensions. J. Algorithms, 38:91–109, 2001.
[BHI02] M. Bădoiu, S. Har-Peled, and P. Indyk. Approximate clustering via coresets. In Proc. 34th Annu. ACM Sympos.
Theory Comput., pages 250–257, 2002.
[BM58] G. E.P. Box and M. E. Muller. A note on the generation of random normal deviates. Annl. Math. Stat., 28:610–611,
1958.
[BMP05] P. Brass, W. Moser, and J. Pach. Research Problems in Discrete Geometry. Springer, 2005.
[BVZ01] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern
Anal. Mach. Intell., 23(11):1222–1239, 2001.
[BY98] J.-D. Boissonnat and M. Yvinec. Algorithmic Geometry. Cambridge University Press, UK, 1998. Translated by H.
Brönnimann.
[Cal95] P. B. Callahan. Dealing with higher dimensions: the well-separated pair decomposition and its applications. Ph.D.
thesis, Dept. Comput. Sci., Johns Hopkins University, Baltimore, Maryland, 1995.
[Car76] L. Carroll. The hunting of the snark, 1876.
[CF90] B. Chazelle and J. Friedman. A deterministic view of random sampling and its use in geometry. Combinatorica,
10(3):229–249, 1990.
[Cha96] T. M. Chan. Fixed-dimensional linear programming queries made easy. In Proc. 12th Annu. ACM Sympos. Comput.
Geom., pages 284–290, 1996.
[Cha98] T. M. Chan. Approximate nearest neighbor queries revisited. Discrete Comput. Geom., 20:359–373, 1998.
[Cha01] B. Chazelle. The Discrepancy Method: Randomness and Complexity. Cambridge University Press, New York, 2001.
[Cha02] T. M. Chan. Closest-point problems simplified on the ram. In Proc. 13th ACM-SIAM Sympos. Discrete Algorithms,
pages 472–473. Society for Industrial and Applied Mathematics, 2002.
[Cha05] T. M. Chan. Low-dimensional linear programming with violations. SIAM J. Comput., pages 879–893, 2005.
[Cha06] T. M. Chan. Faster core-set constructions and data-stream algorithms in fixed dimensions. Comput. Geom. Theory
Appl., 35(1-2):20–35, 2006.
[Che86] L. P. Chew. Building Voronoi diagrams for convex polygons in linear expected time. Technical Report PCS-TR90-
147, Dept. Math. Comput. Sci., Dartmouth College, Hanover, NH, 1986.
[Che06] K. Chen. On k-median clustering in high dimensions. In Proc. 17th ACM-SIAM Sympos. Discrete Algorithms, pages
1177–1185, 2006.
204
[Che07] K. Chen. A constant factor approximation algorithm for k-median with outliers. manuscript, 2007.
[CK95] P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-
neighbors and n-body potential fields. J. Assoc. Comput. Mach., 42:67–90, 1995.
[CKMN01] M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for facility location problems with outliers.
In Proc. 12th ACM-SIAM Sympos. Discrete Algorithms, pages 642–651, 2001.
[CKR01] G. Calinescu, H. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension problem. In Proceedings of
the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 8–16. Society for Industrial and Applied
Mathematics, 2001.
[Cla83] K. L. Clarkson. Fast algorithms for the all nearest neighbors problem. In Proc. 24th Annu. IEEE Sympos. Found.
Comput. Sci., pages 226–232, 1983.
[Cla87] K. L. Clarkson. New applications of random sampling in computational geometry. Discrete Comput. Geom., 2:195–
222, 1987.
[Cla88] K. L. Clarkson. Applications of random sampling in computational geometry, II. In Proc. 4th Annu. ACM Sympos.
Comput. Geom., pages 1–11, 1988.
[Cla93] K. L. Clarkson. Algorithms for polytope covering and approximation. In Proc. 3th Workshop Algorithms Data
Struct., volume 709 of Lect. Notes in Comp. Sci., pages 246–252. Springer-Verlag, 1993.
[Cla94] K. L. Clarkson. An algorithm for approximate closest-point queries. In Proc. 10th Annu. ACM Sympos. Comput.
Geom., pages 160–164, 1994.
[Cla95] K. L. Clarkson. Las Vegas algorithms for linear and integer programming. J. Assoc. Comput. Mach., 42:488–499,
1995.
[CLRS01] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press / McGraw-Hill,
Cambridge, Mass., 2001.
[CM96] B. Chazelle and J. Matoušek. On linear-time deterministic algorithms for optimization problems in fixed dimension.
J. Algorithms, 21:579–597, 1996.
[CMS93] K. L. Clarkson, K. Mehlhorn, and R. Seidel. Four results on randomized incremental constructions. Comput. Geom.
Theory Appl., 3(4):185–212, 1993.
[CS89] K. L. Clarkson and P. W. Shor. Applications of random sampling in computational geometry, II. Discrete Comput.
Geom., 4:387–421, 1989.
[CS00] N. Cristianini and J. Shaw-Taylor. Support Vector Machines. Cambridge Press, 2000.
[CW89] B. Chazelle and E. Welzl. Quasi-optimal range searching in spaces of finite VC-dimension. Discrete Comput. Geom.,
4:467–489, 1989.
[dBDS95] M. de Berg, K. Dobrindt, and O. Schwarzkopf. On lazy randomized incremental construction. Discrete Comput.
Geom., 14:261–286, 1995.
[dBS95] M. de Berg and O. Schwarzkopf. Cuttings and applications. Internat. J. Comput. Geom. Appl., 5:343–355, 1995.
[dBvKOS00] M. de Berg, M. van Kreveld, M. H. Overmars, and O. Schwarzkopf. Computational Geometry: Algorithms and
Applications. Springer-Verlag, 2nd edition, 2000.
[Dey98] T. K. Dey. Improved bounds for planar k-sets and related problems. Discrete Comput. Geom., 19(3):373–382, 1998.
[DG03] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Rand. Struct. Alg.,
22(3):60–65, 2003.
[DK85] D. P. Dobkin and D. G. Kirkpatrick. A linear algorithm for determining the separation of convex polyhedra. J.
Algorithms, 6:381–392, 1985.
[DNIM04] M. Datar, Immorlica N, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distribu-
tions. In Proc. 20th Annu. ACM Sympos. Comput. Geom., pages 253–262, 2004.
[Dud74] R. M. Dudley. Metric entropy of some classes of sets with differentiable boundaries. J. Approx. Theory, 10(3):227–
236, 1974.
[Dun99] C. A. Duncan. Balanced Aspect Ratio Trees. Ph.D. thesis, Department of Computer Science, Johns Hopkins Uni-
versity, Baltimore, Maryland, 1999.
[Dur95] R. Durrett. Probability: Theory and Examples. Duxbury Press, August 1995.
[EGS05] D. Eppstein, M. T. Goodrich, and J. Z. Sun. The skip quadtree: a simple dynamic data structure for multidimensional
data. In Proc. 21st Annu. ACM Sympos. Comput. Geom., pages 296–305. ACM, June 2005.
205
[EK89] O. Egecioglu and B. Kalantari. Approximating the diameter of a set of points in the Euclidean space. Inform.
Process. Lett., 32:205–211, 1989.
[Ele97] G. Elekes. On the number of sums and products. ACTA Arithmetica, pages 365–367, 1997.
[ERvK96] H. Everett, J.-M. Robert, and M. van Kreveld. An optimal algorithm for the (≤k)-levels, with applications to separa-
tion and transversal problems. Internat. J. Comput. Geom. Appl., 6:247–261, 1996.
[Fel71] W. Feller. An Introduction to Probability Theory and its Applications, volume II. John Wiley & Sons, NY, 1971.
[Fel91] W. Feller. An Introduction to Probability Theory and its Applications. John Wiley & Sons, NY, 1991.
[FG88] T. Feder and D. H. Greene. Optimal algorithms for approximate clustering. In Proc. 20th Annu. ACM Sympos.
Theory Comput., pages 434–444, 1988.
[FGK+ 00] A. Fabri, G.-J. Giezeman, L. Kettner, S. Schirra, and S. Schönherr. On the design of CGAL a computational geometry
algorithms library. Softw. – Pract. Exp., 30(11):1167–1202, 2000.
[FH05] J. Fischer and S. Har-Peled. Dynamic well-separated pair decomposition made easy. In CCCG, pages 235–238,
2005.
[FH06] J. Fischer and S. Har-Peled. On coresets for clustering and related problems. manuscript, 2006.
[For97] S. Fortune. Voronoi diagrams and Delaunay triangulations. In J. E. Goodman and J. O’Rourke, editors, Handbook
of Discrete and Computational Geometry, chapter 20. CRC Press LLC, Boca Raton, FL, 1997.
[FRT03] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by tree metrics. In
Proc. 35th Annu. ACM Sympos. Theory Comput., pages 448–455, 2003.
[FS97] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.
J. Comput. Syst. Sci., 55(1):119–139, 1997.
[Gar82] I. Gargantini. An effective way to represent quadtrees. Commun. ACM, 25(12):905–910, 1982.
[Gar02] R. J. Gardner. The Brunn-Minkowski inequality. Bull. Amer. Math. Soc., 39:355–405, 2002.
[GK92] P. Gritzmann and V. Klee. Inner and outer j-radii of convex bodies in finite-dimensional normed spaces. Discrete
Comput. Geom., 7:255–280, 1992.
[GLS88] M. Grötschel, L. Lovász, and A. Schrijver. Geometric Algorithms and Combinatorial Optimization, volume 2 of
Algorithms and Combinatorics. Springer-Verlag, Berlin Heidelberg, 2nd edition, 1988. 2nd edition 1994.
[Gol95] M. Goldwasser. A survey of linear programming in randomized subexponential time. SIGACT News, 26(2):96–104,
1995.
[Gon85] T. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoret. Comput. Sci., 38:293–306, 1985.
[GP84] J. E. Goodman and R. Pollack. On the number of k-subsets of a set of n points in the plane. J. Combin. Theory Ser.
A, 36:101–104, 1984.
[GRSS95] M. Golin, R. Raman, C. Schwarz, and M. Smid. Simple randomized algorithms for closest pair problems. Nordic J.
Comput., 2:3–27, 1995.
[Grü03] B. Grünbaum. Convex Polytopes. Springer, 2nd edition, May 2003. Prepared by Volker Kaibel, Victor Klee, and
Günter Ziegler.
[GT00] A. Gupta and E. Tardos. A constant factor approximation algorithm for a class of classification problems. In Proc.
32nd Annu. ACM Sympos. Theory Comput., pages 652–658, 2000.
[Gup00] A. Gupta. Embeddings of Finite Metrics. PhD thesis, University of California, Berkeley, 2000.
[Har00a] S. Har-Peled. Constructing planar cuttings in theory and practice. SIAM J. Comput., 29(6):2016–2039, 2000.
[Har00b] S. Har-Peled. Taking a walk in a planar arrangement. SIAM J. Comput., 30(4):1341–1367, 2000.
[Har01a] S. Har-Peled. A practical approach for computing the diameter of a point-set. In Proc. 17th Annu. ACM Sympos.
Comput. Geom., pages 177–186, 2001.
[Har01b] S. Har-Peled. A replacement for Voronoi diagrams of near linear size. In Proc. 42nd Annu. IEEE Sympos. Found.
Comput. Sci., pages 94–103, 2001.
[HI00] S. Har-Peled and P. Indyk. When crossings count - approximating the minimum spanning tree. In Proc. 16th Annu.
ACM Sympos. Comput. Geom., pages 166–175, 2000.
[HM03] S. Har-Peled and S. Mazumdar. Fast algorithms for computing the smallest k-enclosing disc. In Proc. 11th Annu.
European Sympos. Algorithms, volume 2832 of Lect. Notes in Comp. Sci., pages 278–288. Springer-Verlag, 2003.
206
[HM04] S. Har-Peled and S. Mazumdar. Coresets for k-means and k-median clustering and their applications. In Proc. 36th
Annu. ACM Sympos. Theory Comput., pages 291–300, 2004.
[HM06] S. Har-Peled and M. Mendel. Fast construction of nets in low dimensional metrics, and their applications. SIAM J.
Comput., 35(5):1148–1184, 2006.
[HS06] S. Har-Peled and M. Sharir. Relative ε-approximations in geometry. Manuscript. Available from http://www.
uiuc.edu/~sariel/papers/06/integrate, 2006.
[HÜ05] S. Har-Peled and A. Üngör. A time-optimal delaunay refinement algorithm in two dimensions. In Proc. 21st Annu.
ACM Sympos. Comput. Geom., pages 228–236, 2005.
[HW87] D. Haussler and E. Welzl. ε-nets and simplex range queries. Discrete Comput. Geom., 2:127–151, 1987.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc.
30th Annu. ACM Sympos. Theory Comput., pages 604–613, 1998.
[Ind99] P. Indyk. Sublinear time algorithms for metric space problems. In Proc. 31st Annu. ACM Sympos. Theory Comput.,
pages 154–159, 1999.
[Ind01] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. In Proc. 42nd Annu. IEEE Sympos.
Found. Comput. Sci., pages 10–31, 2001. Tutorial.
[Ind04] P. Indyk. Nearest neighbors in high-dimensional spaces. In J. E. Goodman and J. O’Rourke, editors, Handbook of
Discrete and Computational Geometry, chapter 39, pages 877–892. CRC Press LLC, Boca Raton, FL, 2nd edition,
2004.
[Joh48] F. John. Extremum problems with inequalities as subsidary conditions. Courant Anniversary, pages 187–204, 1948.
[Kal92] G. Kalai. A subexponential randomized simplex algorithm. In Proc. 24th Annu. ACM Sympos. Theory Comput.,
pages 475–482, 1992.
[KF93] I. Kamel and C. Faloutsos. On packing r-trees. In Proc. 2nd Intl. CConf. Info. Knowl. Mang., pages 490–499, 1993.
[Kle02] J. Kleinberg. An impossibility theorem for clustering. In Neural Info. Proc. Sys., 2002.
[KLMN04] R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: A new embedding method for finite metric
spaces. In Proc. 45th Annu. IEEE Sympos. Found. Comput. Sci., page to appear, 2004.
[KMN+ 04] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation
algorithm for k-means clustering. Comput. Geom. Theory Appl., 28:89–112, 2004.
[KOR00] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional
spaces. SIAM J. Comput., 2(30):457–474, 2000.
[KS06] H. Kaplan and M. Sharir. Randomized incremental constructions of three-dimensional convex hulls and planar
voronoi diagrams, and approximate range counting. In Proc. 17th ACM-SIAM Sympos. Discrete Algorithms, pages
484–493, 2006.
[KT06] J. Kleinberg and E. Tardos. Algorithm design. Addison-Wesley, 2006.
[Lei84] F. T. Leighton. New lower bound techniques for VLSI. Math. Syst. Theory, 17:47–70, 1984.
[Leo98] S. J. Leon. Linear Algebra with Applications. Prentice Hall, 5th edition, 1998.
[Lit88] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Mach. Learn.,
2(4):285–318, 1988.
[LLS01] Y. Li, P. M. Long, and A. Srinivasan. Improved bounds on the sample complexity of learning. J. Comput. Syst. Sci.,
62(3):516–527, 2001.
[Mac50] A.M. Macbeath. A compactness theorem for affine equivalence-classes of convex regions. Canad. J. Math, 3:54–61,
1950.
[Mag02] A. Magen. Dimensionality reductions that preserve volumes and distance to affine spaces, and their algorithmic
applications. In The 6th Intl. Work. Rand. Appr. Tech. Comp. Sci., pages 239–253, 2002.
[Mat90] J. Matoušek. Bi-lipschitz embeddings into low-dimensional euclidean spaces. Comment. Math. Univ. Carolinae,
31:589–600, 1990.
[Mat92] J. Matoušek. Efficient partition trees. Discrete Comput. Geom., 8:315–334, 1992.
[Mat95a] J. Matoušek. On enclosing k points by a circle. Inform. Process. Lett., 53:217–221, 1995.
[Mat95b] J. Matoušek. On geometric optimization with few violated constraints. Discrete Comput. Geom., 14:365–384, 1995.
[Mat98] J. Matoušek. On constants for cuttings in the plane. Discrete Comput. Geom., 20:427–448, 1998.
207
[Mat99] J. Matoušek. Geometric Discrepancy. Springer, 1999.
[Mat02] J. Matoušek. Lectures on Discrete Geometry. Springer, 2002.
[Meg83] N. Megiddo. Linear-time algorithms for linear programming in R3 and related problems. SIAM J. Comput.,
12(4):759–776, 1983.
[Meg84] N. Megiddo. Linear programming in linear time when the dimension is fixed. J. Assoc. Comput. Mach., 31:114–127,
1984.
[Mil04] G. L. Miller. A time efficient Delaunay refinement algorithm. In Proc. 15th ACM-SIAM Sympos. Discrete Algorithms,
pages 400–409, 2004.
[MN98] J. Matoušek and J. Nešetřil. Invitation to Discrete Mathematics. Oxford Univ Pr, 1998.
[MP03] R. R. Mettu and C. G. Plaxton. The online median problem. SIAM J. Comput., 32(3):816–832, 2003.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, New York, NY, 1995.
[MSW96] J. Matoušek, M. Sharir, and E. Welzl. A subexponential bound for linear programming. Algorithmica, 16:498–516,
1996.
[Mul94a] K. Mulmuley. Computational Geometry: An Introduction Through Randomized Algorithms. Prentice Hall, Engle-
wood Cliffs, NJ, 1994.
[Mul94b] K. Mulmuley. An efficient algorithm for hidden surface removal, II. J. Comp. Sys. Sci., 49:427–453, 1994.
[O’R85] J. O’Rourke. Finding minimal enclosing boxes. Internat. J. Comput. Inform. Sci., 14:183–199, 1985.
[OvL81] M. H. Overmars and J. van Leeuwen. Maintenance of configurations in the plane. J. Comput. Syst. Sci., 23:166–204,
1981.
[Rab76] M. O. Rabin. Probabilistic algorithms. In J. F. Traub, editor, Algorithms and Complexity: New Directions and Recent
Results, pages 21–39. Academic Press, New York, NY, 1976.
[Rup93] J. Ruppert. A new and simple algorithm for quality 2-dimensional mesh generation. In Proc. 4th ACM-SIAM Sympos.
Discrete Algorithms, pages 83–92, 1993.
[SA95] M. Sharir and P. K. Agarwal. Davenport-Schinzel Sequences and Their Geometric Applications. Cambridge Uni-
versity Press, New York, 1995.
[Sag94] H. Sagan. Space-Filling Curves. Springer-Verlag, 1994.
[Sam89] H. Samet. Spatial Data Structures: Quadtrees, Octrees, and Other Hierarchical Methods. Addison-Wesley, Reading,
MA, 1989.
[Sei91] R. Seidel. Small-dimensional linear programming and convex hulls made easy. Discrete Comput. Geom., 6:423–434,
1991.
[Sei93] R. Seidel. Backwards analysis of randomized geometric algorithms. In J. Pach, editor, New Trends in Discrete and
Computational Geometry, volume 10 of Algorithms and Combinatorics, pages 37–68. Springer-Verlag, 1993.
[Sha03] M. Sharir. The Clarkson-Shor technique revisited and extended. Comb., Prob. & Comput., 12(2):191–201, 2003.
[Smi00] M. Smid. Closest-point problems in computational geometry. In Jörg-Rüdiger Sack and Jorge Urrutia, editors,
Handbook of Computational Geometry, pages 877–935. Elsevier Science Publishers B. V. North-Holland, Amster-
dam, 2000.
[SSS02] Y. Sabharwal, N. Sharma, and S. Sen. Improved reductions of nearest neighbors search to plebs with applications to
linear-sized approximate voronoi decopositions. In Proc. 22nd Conf. Found. Soft. Tech. Theoret. Comput. Sci., pages
311–323, 2002.
[Sto91] J. Stolfi. Oriented Projective Geometry: A Framework for Geometric Computations. Academic Press, New York,
NY, 1991.
[SW92] M. Sharir and E. Welzl. A combinatorial bound for linear programming and related problems. In Proc. 9th Sympos.
Theoret. Aspects Comput. Sci., volume 577 of Lect. Notes in Comp. Sci., pages 569–579. Springer-Verlag, 1992.
[Szé97] L. A. Székely. Crossing numbers and hard Erdős problems in discrete geometry. Combinatorics, Probability and
Computing, 6:353–358, 1997.
[Tót01] G. Tóth. Point sets with many k-sets. Discrete Comput. Geom., 26(2):187–194, 2001.
[Tou83] G. T. Toussaint. Solving geometric problems with the rotating calipers. In Proc. IEEE MELECON ’83, pages
A10.02/1–4, 1983.
[Üng04] A. Üngör. Off-centers: A new type of steiner points for computing size-optimal quality-guaranteed delaunay trian-
gulations. In Latin Amer. Theo. Inf. Symp., pages 152–161, 2004.
208
[Vai86] P. M. Vaidya. An optimal algorithm for the all-nearest-neighbors problem. In Proc. 27th Annu. IEEE Sympos. Found.
Comput. Sci., pages 117–122, 1986.
[Van97] R. J. Vanderbei. Linear programming: Foundations and extensions. Kluwer, 1997.
[VC71] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their proba-
bilities. Theory Probab. Appl., 16:264–280, 1971.
[Wel86] E. Welzl. More on k-sets of finite sets in the plane. Discrete Comput. Geom., 1:95–100, 1986.
[Wel92] E. Welzl. On spanning trees with low crossing numbers. In Data Structures and Efficient Algorithms, Final Report
on the DFG Special Joint Initiative, volume 594 of Lect. Notes in Comp. Sci., pages 233–249. Springer-Verlag, 1992.
[WVTP97] M. Waldvogel, G. Varghese, J. Turener, and B. Plattner. Scalable high speed ip routing lookups. In Proc. ACM
SIGCOMM 97, Octeber 1997.
[YAPV04] H. Yu, P. K. Agarwal, R. Poreddy, and K. R. Varadarajan. Practical methods for shape fitting and kinetic data
structures using core sets. In Proc. 20th Annu. ACM Sympos. Comput. Geom., pages 263–272, 2004.
209
Index
210
edge ratio, 32 order
embedding, 189 Q, 29
excess, 18, 20 z, 29
expansive mapping, 190 outliers, 60
exponential distribution, 139
extended cluster, 32 passes, 108
extent, 169, 184 Peano curve, 37
planar, 86
face, 110 point above hyperplane, 185
facility location, 60 point below hyperplane, 185
fair split tree, 48 Poisson distribution, 139
feasible solution, 97 polyhedron, 97
finger tree, 26 polytope, 110
probabilistic distortion, 192
gamma distribution, 139 Problem
Gradation, 27 Dominating Set, 59
gradation, 17, 27 Satisfiability, 60
greedy permutation, 53 Set Cover, 153, 156, 159
grid, 13 Set Covering, 179
cluster, 13 Traveling Salesperson, 60
ground set, 63 uniqueness, 15
Vertex Cover, 60
heavy, 19
quadtree
incidence
balanced, 32
line-point, 87
compressed, 25
isoperimetric inequality, 135
linear, 35
killing set, 80, 88
radius, 15
lazy randomize incremental construction, 83 Random sampling
level, 24, 85, 116 Weighted Sets, 178
line Randomized Incremental Construction, 77
support, 183 range, 63
linear program range space, 63
vertex, 97 projection, 63
Linear programming, 97 region, 25
linear programming RIC, 77
unbounded, 97 ring tree, 119
linearization, 171
Lipschitz, 137 separated
local search, 54, 60 sets, 39
k-median clustering, 55 Separation property, 54
lower convex chain, 185 separator, tree, 26
lower envelope, 169, 184 shatter function, 65
shattered, 63
median, 137 shattering dimension, 65
metric, 51, 189 simplex, 98, 113
metric space, 51, 189 simulated annealing, 60
metric spaces sink, 113
low doubling dimension, 49 skip-quadtree, 27
Metropolis algorithm, 60 spanner, 43
Minkowski sum, 133 sponginess, 50
moments technique, 82 spread, 24, 193
all regions, 80, 88 squared, 50
monotone ε-shell set, 177 stretch, 43
successful, 157
nearest neighbor, 45
net, 54 target function, 97
normal distribution, 139 Theorem 14.1.2, 134
211
upper convex chain, 185
upper envelope, 169, 184
VC dimension, 63
vertex, 110
vertex figure, 111
vertical decomposition, 77
vertex, 77
visibility polygon, 157
Voronoi
partition, 52
Voronoi diagram, 186
weight, 85
region, 80, 88
well-balanced, 32
well-separated pairs decomposition, 39
width, 13
WSPD, 39, 40
generator, 46
212