0% found this document useful (0 votes)

123 views212 pages

Computational Geomatory

Uploaded by

Siddharth Charan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views212 pages

Computational Geomatory

Uploaded by

Siddharth Charan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 212

Geometric Approximation Algorithms

Sariel Har-Peled¬

January 16, 2008

Department of Computer Science; University of Illinois; 201 N. Goodwin Avenue; Urbana, IL, 61801, USA;
[email protected]; http://www.uiuc.edu/~sariel/. Work on this paper was partially supported by a NSF
CAREER award CCR-0132901.
2
Contents

Contents 3

Preface 11

1 The Power of Grids - Computing the Minimum Disk Containing k Points 13

1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Closest Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 A Slow 2-Approximation Algorithm for the k-Enclosing Disk . . . . . . . . . . . . . . . . . 15
1.4 A Linear Time 2-Approximation for the k-Enclosing Disk . . . . . . . . . . . . . . . . . . . 16
1.4.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.1.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Quadtrees - Hierarchical Grids 23

2.1 Quadtrees - a simple point-location data-structure . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.1 Fast point-location in a quadtree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Compressed Quadtrees: Range Searching Made Easy . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Efficient construction of compressed quadtrees . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Fingering a Compressed Quadtree - Fast Point Location . . . . . . . . . . . . . . . 26
2.3 Dynamic Quadtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Inserting a point into the skip-quadtree. . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Even More on Dynamic Quadtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Ordering of nodes and points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1.1 Computing the Q-order quickly . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Performing a point-location in a quadtree . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Overlaying two quadtrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.4 Point location in a compressed quadtree . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.5 Inserting/deleting a point into/from a compressed quadtree . . . . . . . . . . . . . . 31
2.5 Balanced quadtrees, and good triangulations . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3
3 Well Separated Pairs Decomposition 39
3.1 Well-separated pairs decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 The construction algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Applications of WSPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Spanners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Approximating the Minimum Spanning Tree . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 Approximating the Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.4 Closest Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.5 All Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.5.1 The bounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.5.2 All nearest neighbor - the unbounded spread case . . . . . . . . . . . . . 47
3.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 Clustering - Definitions and Basic Algorithms 51

4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 On k-Center Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.1 The Greedy Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.2 The greedy permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 On k-median clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Proof of correctness - a game of dictators and drifters . . . . . . . . . . . . . . . . . 55
4.4 On k-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 On Complexity, Sampling, and ε-Nets and ε-Samples 63

5.1 VC Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1.1 Half spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Shattering Dimension and the Dual Shattering Dimension . . . . . . . . . . . . . . . . . . . 65
5.2.1 The Dual Shattering Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.1.1 Mixing range spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 On ε-nets and ε-sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.1 ε-nets and ε-samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.2 Some Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2.1 Range searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2.2 Learning a concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.3 A quicky proof of Theorem 5.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4.1 Building ε-sample via Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.2 Building ε-net via Discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5 Proof of the ε-net Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6.1 Variants and extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4
6 Sampling and the Moments Technique 77
6.1 Vertical Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.1.1 Backward Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 General Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.1 Analyzing the RIC Algorithm for Vertical Decomposition . . . . . . . . . . . . . . 82
6.3.2 Cuttings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5 Proof of Lemma 6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

7 Depth estimation via sampling 85

7.1 The at most k-levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 The Crossing Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2.1 On the number of incidences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.2 On the number of k-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3 A general bound for the at most k-weight . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8 Approximating the Depth via Sampling and Emptiness 91

8.1 From Emptiness to Approximate Range Counting . . . . . . . . . . . . . . . . . . . . . . . 91
8.1.1 The decision procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.1.1.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.1.2 Answering approximate counting query . . . . . . . . . . . . . . . . . . . . . . . . 93
8.1.3 The data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Application: halfplane and halfspace range counting . . . . . . . . . . . . . . . . . . . . . 94
8.3 Relative approximation via sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

9 Linear programming in Low Dimensions 97

9.1 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1.1 A solution, and how to verify it . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.2 Low Dimensional Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.2.1 An algorithm for a restricted case . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.2.1.1 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2.2 The algorithm for the general case . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.3 Linear Programming with Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9.4 Approximate Linear Programming with Violations . . . . . . . . . . . . . . . . . . . . . . 100
9.5 LP-type problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
9.5.1 Examples for LP-type problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

10 Polyhedrons, Polytopes and Linear Programming 105

10.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.1.1 Properties of polyhedrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.1.2 Vertices of a polytope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
10.2 Linear Programming Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10.3 Garbage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5
10.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

11 Approximate Nearest Neighbor Search in Low Dimension 115

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
11.2 The bounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
11.3 ANN – the unbounded general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
11.3.1 Extending a compressed quadtree to support cell queries . . . . . . . . . . . . . . . 117
11.3.2 Putting things together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
11.4 Low Quality ANN Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
11.4.1 Low Quality ANN Search - Point Location with Random Shifting . . . . . . . . . . 118
11.4.1.1 The data-structure and search procedure . . . . . . . . . . . . . . . . . . 118
11.4.1.2 Proof of correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
11.4.2 Low Quality ANN Search - The Ring Separator Tree . . . . . . . . . . . . . . . . . 119
11.5 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
11.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

12 Approximate Nearest Neighbor via Point-Location among Balls 123

12.1 Hierarchical Representation of Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
12.1.1 Low Quality Approximation by HST . . . . . . . . . . . . . . . . . . . . . . . . . 123
12.2 ANN using Point-Location Among Balls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
12.2.1 Handling a Range of Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
12.2.2 The General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
12.2.2.1 Efficient Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
12.3 ANN using Point-Location Among Approximate Balls . . . . . . . . . . . . . . . . . . . . 127
12.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

13 Approximate Voronoi Diagrams 129

13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
13.2 Fast ANN in IRd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
13.3 A Direct Construction of AVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
13.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

14 The Johnson-Lindenstrauss Lemma 133

14.1 The Brunn-Minkowski inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
14.1.1 The Isoperimetric Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.2 Measure Concentration on the Sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.2.1 The strange and curious life of the hypersphere . . . . . . . . . . . . . . . . . . . . 135
14.2.2 Measure Concentration on the Sphere . . . . . . . . . . . . . . . . . . . . . . . . . 136
14.3 Concentration of Lipschitz Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
14.4 The Johnson-Lindenstrauss Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
14.5 An alternative proof of the Johnson-Lindenstrauss lemma . . . . . . . . . . . . . . . . . . . 139
14.5.1 Some Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
14.5.2 The proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
14.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
14.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6
15 ANN in High Dimensions 143
15.1 ANN on the Hypercube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
15.1.1 Hypercube and Hamming distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
15.1.2 Constructing NNbr for the Hamming cube . . . . . . . . . . . . . . . . . . . . . . . 143
15.1.3 Construction the near-neighbor data-structure . . . . . . . . . . . . . . . . . . . . . 144
15.2 LSH and ANN in Euclidean Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
15.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
15.2.2 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
15.2.3 ANN in High Dimensional Euclidean Space . . . . . . . . . . . . . . . . . . . . . . 146
15.2.3.1 Low quality HST in high dimensional Euclidean space . . . . . . . . . . . 146
15.2.3.2 The overall result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
15.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

16 Approximating a Convex Body by An Ellipsoid 149

16.1 Some Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
16.2 Ellipsoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
16.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

17 Approximation via Reweighting 153

17.1 Computing a spanning tree with low stabbing number . . . . . . . . . . . . . . . . . . . . . 153
17.1.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
17.1.2 Proof of correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
17.1.3 An application - better discrepancy . . . . . . . . . . . . . . . . . . . . . . . . . . 155
17.1.4 Spanning tree for space with bounded shattering dimension . . . . . . . . . . . . . . 155
17.2 Geometric Set Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
17.2.1 Proof of correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
17.2.2 Application - Guarding an art gallery . . . . . . . . . . . . . . . . . . . . . . . . . 157
17.2.2.1 A proof of Theorem 17.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . 158
17.3 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
17.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

18 Approximating the Minimum Volume Bounding Box of a Point Set 161

18.1 Some Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
18.2 Approximating the Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
18.3 Approximating the minimum volume bounding box . . . . . . . . . . . . . . . . . . . . . . 162
18.3.1 Constant Factor Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
18.4 Exact Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
18.4.1 An exact algorithm 2d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
18.4.2 An exact algorithm 3d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
18.5 Approximating the Minimum Volume Bounding Box in Three Dimensions . . . . . . . . . . 163
18.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

19 Approximating the Directional Width of a Shape 165

19.1 Coreset for Directional Width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
19.2 Smaller coreset for directional width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
19.2.1 Transforming a set into a fat set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
19.2.2 Computing a smaller coreset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
19.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7
19.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

20 Approximating the Extent of Lines, Hyperplanes and Moving Points 169

20.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
20.2 Motivation - Maintaining the Bounding Box of Moving Points . . . . . . . . . . . . . . . . 169
20.3 Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
20.4 Extent of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
20.5 Roots of Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
20.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
20.6.1 Minimum Width Annulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
20.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
20.8 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

21 Approximating the Extent of Lines, Hyperplanes and Moving Points II 173

21.1 More Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
21.1.1 Maintaining certain measures of moving points . . . . . . . . . . . . . . . . . . . . 173
21.1.1.1 Computing an ε-coreset for directional width. . . . . . . . . . . . . . . . 173
21.1.2 Minimum-width cylindrical shell . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
21.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
21.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

22 Approximation Using Shell Sets 177

22.1 Covering problems, expansion and shell sets . . . . . . . . . . . . . . . . . . . . . . . . . . 177
22.2 The Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
22.3 The Algorithm for Computing the Shell Set . . . . . . . . . . . . . . . . . . . . . . . . . . 178
22.3.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
22.3.2 Set Covering in Geometric Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 179
22.4 Application - Covering Points by Cylinders . . . . . . . . . . . . . . . . . . . . . . . . . . 180
22.5 Clustering and Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
22.6 Union of Cylinders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
22.6.0.1 Covering by Cylinders - A Slow Algorithm . . . . . . . . . . . . . . . . . 181
22.6.1 Existence of a Small Coreset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
22.7 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

23 Duality 183
23.1 Duality of lines and points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
23.1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
23.1.1.1 Segments and Wedges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
23.1.1.2 Convex hull and upper/lower envelopes . . . . . . . . . . . . . . . . . . . 184
23.2 Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
23.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
23.4 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
23.4.1 Projective geometry and duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
23.4.2 Duality, Voronoi Diagrams and Delaunay Triangulations. . . . . . . . . . . . . . . . 186

8
24 Finite Metric Spaces and Partitions 189
24.1 Finite Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
24.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
24.2.1 Hierarchical Tree Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
24.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
24.3 Random Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
24.3.1 Constructing the partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
24.3.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
24.4 Probabilistic embedding into trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
24.4.1 Application: approximation algorithm for k-median clustering . . . . . . . . . . . . 192
24.5 Embedding any metric space into Euclidean space . . . . . . . . . . . . . . . . . . . . . . . 193
24.5.1 The bounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
24.5.2 The unbounded spread case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
24.6 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
24.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

25 Tail Inequalities 197

25.1 Markov Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
25.2 Tail Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
25.2.1 The Chernoff Bound — Special Case . . . . . . . . . . . . . . . . . . . . . . . . . 197
25.2.2 The Chernoff Bound — General Case . . . . . . . . . . . . . . . . . . . . . . . . . 199
25.2.3 A More Convenient Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
25.3 Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
25.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Bibliography 203

Index 210

9
10
Preface

This manuscript is a collection of class notes on geometric approximation algorithms. It represents the book
I wish I could have read when I started doing research.
There are without doubt errors and mistakes in the text and I would like to know about them. Please
email me about any of them you find.

On optimality. Since this text is intended to explain key ideas and algorithms, I had consistently avoided
the trap of optimality, when the optimal known algorithm is way more complicated and less insightful
than some other algorithm. A reference to the optimal known algorithm would be usually described in the
bibliographical notes.

A note on style. Injected into the text are random comments (usually as footnotes) that have nothing
directly to do with the text. I hope these comments make the text more enjoyable to read, and I added them,
on the spur of the moment, to amuse myself. Some readers might find these comments irritating and vain,
and I humbly ask these readers to ignore them.

Acknowledgements
I had the benefit of interacting with numerous people on the work in this book. Sometime directly or
indirectly. There is something mundane and predictable in enumerating a long list of people that helped and
contributed to this work, but this in no way diminish their contribution and their help.
As such, I would like to thank the students in the class for their input, which helped in discovering
numerous typos and errors in the manuscript. Furthermore, the content was greatly effected by numerous
insightful discussions with Jeff Erickson and Edgar Ramos. Other people qthat provided comments or
insights, answered nagging emails from me, for which I am thankful for include Bernard Chazelle, John
Fischer, Samuel Hornus, Piotr Indyk, Mira Lee, Jirka Matoušek, and Manor Mendel.
I am sure that other people had contributed to this work, and I had forgot to mention them – they have
my thanks and apologies.

Getting the source for this work

This work was written using LATEX. Figures were drawn using ipe. You can get the source code of these
class notes from http://valis.cs.uiuc.edu/~sariel/teach/notes/aprx. See below for detailed
copyright notice.
In any case, if you are using these class notes and find them useful, it would be nice if you send me an
email.

11
Copyright
This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative
Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

— Sariel Har-Peled
April 2007, Daejeon, Korea

12
Chapter 1

The Power of Grids - Computing the

Minimum Disk Containing k Points

The Peace of Olivia. How sweat and peaceful it sounds! There the great powers noticed for the first time that the
land of the Poles lends itself admirably to partition.
– The tin drum, Gunter Grass

In this chapter, we are going to discuss two basic geometric algorithms. The first one, computes the
closest pair among a set of n points in linear time. This is a beautiful and surprising result that exposes the
computational power of using grids for geometric computation. Next, we discuss a simple algorithm for
approximating the smallest enclosing ball that contains k points of the input. This at first looks like a bizarre
problem, but turns out to be a key ingredient to our later discussion.

1.1 Preliminaries
For a real positive number r and a point p = (x, y) in IR2 , define Gr (p) to be the grid point (bx/rc r, by/rc r).
We call r the width of the grid Gr . Observe that Gr partitions the plane into square regions, which we call
grid cells. Formally, for any i, j ∈ Z, the intersection of the half-planes x ≥ ri, x < r(i + 1), y ≥ r j and
y < r( j + 1) is said to be a grid cell. Further we define a grid cluster as a block of 3 × 3 contiguous grid cells.
Note, that every grid cell C of Gr , has a unique ID; indeed, let p = (x, y) be any point in C, and consider
the pair of integer numbers idC = id(p) = (bx/rc , by/rc). Clearly, only points inside C are going to be
mapped to idC . This is very useful, since we store a set P of points inside a grid efficiently. Indeed, given a
point p, compute its id(p). We associate with each unique id a data-structure that stores all the points falling
into this grid cell (of course, we do not maintain such data-structures for grid cells which are empty). So,
once we computed id(p), we fetch the data structure associated with this cell, by using hashing. Namely,
we store pointers to all those data-structures in a hash table, where each such data-structure is indexed by its
unique id. Since the ids are integer numbers, we can do the hashing in constant time.

Assumption 1.1.1 Throughout the discourse, we assume that every hashing operation takes (worst case)
constant time. This is quite a reasonable assumption when true randomness is available (using for example
perfect hashing [CLRS01]).

For a point set P, and parameter r, the partition of P into subsets by the grid Gr , is denoted by Gr (P).
More formally, two points p, q ∈ P belong to the same set in the partition Gr (P), if both points are being
mapped to the same grid point or equivalently belong to the same grid cell.

13
1.2 Closest Pair
We are interested in solving the following problem:

Problem 1.2.1 Given a set P of n points in the plane, find the pair of points closest to each other. Formally,
return the pair of points realizing CP(P) = min p,q∈P kp − qk.

Lemma 1.2.2 Given a set P of n points in the plane, and a distance r, one can verify in linear time, whether
CP(P) < r, CP(P) = r, or CP(P) > r.
Proof: Indeed, store the points of P in the grid Gr . For every non-empty grid cell, we maintain a linked
list of the points inside it. Thus, adding a new point p takes constant time. Indeed, compute id(p), check if
id(p) already appears in the hash table, if not, create a new linked list for the cell with this ID number, and
store p in it. If a data-structure already exist for id(p), just add p to it.
This takes O(n) time. Now, if any grid cell in Gr (P) contains more than, say, 9 points of p, then it must
be that the CP(P) < r. Indeed, consider a cell C containing more than nine points of P, and partition C into
3 × 3 equal squares. Clearly, one of those squares
√ must contain two points of P, and let C 0 be this square.
0
Clearly, the diameter of C = diam(C)/3 = r + r2 /3 < r. Thus, the two (or more) points of P in C 0 are at
2

distance smaller than r from each other.

Thus, when we insert a point p, we can fetch all the points of P that were already inserted, in the cell of
p, and the 8 adjacent cells. All those cells, must contain at most 9 points of P (otherwise, we would already
have stopped since the CP(·) of inserted points, is smaller than r). Let S be the set of all those points, and
observe that |S | ≤ 9 · 9 = O(1). Thus, we can compute by brute force the closest point to p in S . This takes
O(1) time. If d(p, S ) < r, we stop, otherwise, we continue to the next point.
Overall, this takes O(n) time. As for correctness, first observe that if CP(P) > r then the algorithm
would never make a mistake, since it returns ‘CP(P) < r’ only after finding a pair of points of P with
distance smaller than r. Thus, assume that p, q are the pair of points of P realizing the closest pair, and
kpqk = CP(P) < r. Clearly, when the later of them, say p, is being inserted, the set S would contain q, and
as such the algorithm would stop and return ‘CP(P) < r’.
Lemma 1.2.2 provides a natural way of computing CP(P). Indeed, permute the points of P in arbitrary
fashion, and let P = hp1 , . . . , pn i. Next, let ri = CP({p1 , . . . , pi }). We can check if ri+1 < ri , by just calling
the algorithm for Lemma 1.2.2 on Pi+1 and ri . In fact, if ri+1 < ri , the algorithm of Lemma 1.2.2, would give
us back the distance ri+1 (with the other point realizing this distance).
So, consider the “good” case, where ri = ri−1 ; that is, the length of the shortest pair does not change
when pi is inserted. In this case, we do not need to rebuild the data structure of Lemma 1.2.2 for the ith
point. We can just reuse it from the previous iteration by inserting pi into it. Thus, inserting a single point
takes constant time, as long as the closest pair does not change.
Things become problematic when ri < ri−1 , because then we need to rebuild the grid data structure, and
reinsert all the points of Pi = hp1 , . . . , pi+1 i into the new grid Gri (Pi ). This takes O(i) time.
Specifically, if the closest pair distance, in the sequence r1 , . . . , rn , changes only k times, then the running
time of our algorithm would be O(nk). In fact, we can do even better.

Theorem 1.2.3 For set P of n points in the plane, one can compute the closest pair of P in expected linear
time.
Proof: Pick a random permutation of the points of P, let hp1 , . . . , pn i be this permutation. Let r2 =
kp1 p2 k, and start inserting the points into the data structure of Lemma 1.2.2. In the ith iteration, if ri = ri−1 ,
then this insertion takes constant time. If ri < ri−1 , then we rebuild the grid and reinsert the points. Namely,
we recompute Gri (Pi ).

14
To analyze the running time of this algorithm, let Xi be the indicator variable which is 1 if ri , ri−1 , and
0 otherwise. Clearly, the running time is proportional to
X
n
R=1+ (1 + Xi · i) .
i=2

Thus, the expected running time is

 
 Xn  X
n X
n
E[R] = E1 + (1 + Xi · i) = n + (E[Xi ] · i) = n + i · Pr[Xi = 1] ,
i=2 i=2 i=2

by linearity of expectation and since for indicator variable Xi , we have E[Xi ] = Pr[Xi = 1].
Thus, we need to bound Pr[Xi = 1] = Pr[ri < ri−1 ]. To bound this quantity, fix the points of Pi , and
randomly permute them. A point q ∈ Pi is called critical, if CP(Pi \ {q}) > CP(Pi ). If there are no critical
points, then ri−1 = ri and then Pr[Xi = 1] = 0. If there is one critical point, then Pr[Xi = 1] = 1/i, as this is
the probability that this critical point, would be the last point in the random permutation of Pi .
If there are two critical points, and let p, q be this unique pair of points of Pi realizing CP(Pi ). The
quantity ri is smaller than ri−1 , one if either p or q are pi . But the probability for that is 2/i (i.e., the
probability in a random permutation of i objects, that one of two marked objects would be the last element
in the permutation).
Observe, that there can not be more than two critical points. Indeed, if p and q are two points that
realizing the closest distance, than if there is a third critical point r, then CP(Pi \ {r}) = kpqk, and r is not
critical.
We conclude that
Xn Xn
2
E[R] = n + i · Pr[Xi = 1] ≤ n + i · ≤ 3n,
i=2 i=2
i
and the expected running time is O(E[R]) = O(n).
Theorem 1.2.3 is a surprising result, since it implies that uniqueness (i.e., deciding if n real numbers are
all distinct) can be solved in linear time. However, there is a lower bound of Ω(n log n) on uniqueness, using
the comparison model. This reality dysfunction can be easily explained once one realizes that the computa-
tion of Theorem 1.2.3 is considerably stronger, using hashing, randomization, and the floor function.

1.3 A Slow 2-Approximation Algorithm for the k-Enclosing Disk

For a circle D, we denote by radius(D) the radius of D.
Let Dopt (P, k) be a disk of minimum radius which contains k points of P, and let ropt (P, k) denote the
radius of Dopt (P, k).
Let P be a set of n points in the plane. Compute a set of m = O(n/k) horizontal lines h1 , . . . , hm such
that between two consecutive horizontal lines, there are at most k/4 points of P in the strip they define. This
can be easily done in O(n log(n/k)) time using deterministic median selection together with recursion.
Similarly, compute a set of vertical lines v1 , . . . , vm , such that between two consecutive lines, there are at
most k/4 points of P.

Indeed, compute the median in the x-order of the points of P, split P into two sets, and recurse on each set, till the number of
points in a subproblem is of size ≤ k/4. We have T (n) = O(n) + 2T (n/2), and the recursion stops for n ≤ k/4. Thus, the recursion
tree has depth O(log(n/k)), which implies running time O(n log(n/k)).

15
Consider the (non-uniform) grid G induced by the lines h1 , . . . , hm
and v1 , . . . , vm . Let X be the set of all intersection points of G. We
claim that Dopt (P, k) contains at least one point of X. Indeed, con-
sider the center u of Dopt (P, k), and let c be the cell of G that contains
u. Clearly, if Dopt (P, k) does not cover any of the four vertices of c, c
then it can cover only points in the vertical strip of G that contains
c, and only points in the horizontal strip of G that contains c. See Fig-
Dopt
ure 1.1. However, each such strip contains at most k/4 points. It follows
that Dopt (P, k) contains at most k/2 points of P, a contradiction. Thus,
Dopt (P, k) must contain a point of X. For every point p ∈ X, compute
the smallest circle centered at p that contains k points of P. Clearly,
for a point q ∈ X ∩ Dopt (P, k), this yields the required 2-approximation.
Figure 1.1: If the disk Dopt (P, k)
Indeed, the disk of radius 2Dopt (P, k) centered at q contains at least k
does not contain any vertex of the
points of P since it also covers Dopt (P, k). We summarize as follows:
cell c, then it does not cover any
shaded area. As such, it can con-
Lemma 1.3.1 Given a set P of n points in the plane, and parameter
tain at most k/2 points, since the
k, one can compute in O(n(n/k)2 ) deterministic time, a circle D that
vertical and horizontal strips con-
contains k points of P, and radius(D) ≤ 2ropt (P, k).
taining c, each has at most k/4
points
Corollary 1.3.2 Given a set of P of n points and a parameter k = Ω(n), one canofcompute
P insideinit.linear time, a
circle D that contains k points of P and radius(D) ≤ 2ropt (P, k).

1.4 A Linear Time 2-Approximation for the k-Enclosing Disk

Stuff Warn- In the following, we present a linear time algorithm for approximating the minimum enclosing disk. While
interesting on their own right, the results here are not used later and can be skipped on first, second, third
(or any other) reading.

1.4.1 The algorithm

We refer to the algorithm of Lemma 1.3.1 as ApproxHeavy(P, k).

Remark 1.4.1 For a point set P of n points, the radius r returned by the algorithm of Lemma 1.3.1 is the
distance between a vertex of the non-uniform grid and a point of P. As such, a grid Gr computed using this
distance is one of O(n3 ) possible grids. Indeed, a circle is defined by the distance between a vertex of the
non-uniform grid of Lemma 1.3.1, and a point of P. A vertex of such a grid is determined by two points of
P, through which the vertical and horizontal line passes. Thus, there are O(n3 ) such triples.

Let gdr (P) denote the maximum number of points of P mapped to a single point by the mapping Gr .
Define depth(P, r) to be the maximum number of points of P that a circle of radius r can contain.

Lemma 1.4.2 For any point set P, and r > 0, we have: (i) for any real number A > 0, it holds

(i) depth(P, Ar) ≤ (A + 1)2 depth(P, r),

(ii) gdr (P) ≤ depth(P, r) ≤ 9gdr (P),

(iii) if r ≤ 2ropt (P, k) then gdr (P) ≤ 5k, and

(iv) any circle of radius r is covered by at least one grid cluster in Gr .

16
Proof: (i) Consider the disk D of radius Ar realizing depth(P, Ar), and let D0 be the disk of radius
(A + 1)r having the same center as D. For every point p ∈ V = P ∩ D, place a disk around it of radius r,
and let S denote the resulting set of disks. Since every disk of S has area πr2 , and they are all contained
in D0 , which is of area π(A + 1)2 r2 , it follows that there must be a point p inside D0 which is contained in
|V|πr2
l depth(P,Ar) m
µ = π(A+1) 2 r2 = (A+1)2
disks. This means, that the disk D of radius r centered at p contains at least µ
points of P. Now, µ ≤ |D ∩ P| ≤ depth(P, r). Thus, depth( P,Ar)
(A+1)2
≤ µ ≤ depth(P, r), as claimed.
(iv) Consider a (closed) disk D of radius r, and let c be its center. If c is in the interior of a grid cell C,
then the claim easily holds, since D can intersect only C or the cells adjacent to C. Namely, D is contained
in the cluster centered at C. The problematic case, is when c is on the grid boundaries. Since grid cells are
closed on one side, and open on the other, it is easy to verify that the claim holds again by careful and easy
case analysis, which we will skip here.
(ii) Consider the grid cell C√ of Gr (P) that realizes
√ gdr (P), and let c be a point placed in the center of
C. Clearly, a disk D of radius r2 + r2 /2 = r/ 2 centered at c, would cover completely the cell C. Thus,
gdr (P) ≤ |D ∩ P| ≤ depth(P, r). As for the other inequality, observe that the disk D realizing depth(P, r),
can intersect at most 9 cells of the grid Gr , by (iv). Thus, depth(P, r) = |D ∩ P| ≤ 9gdr (P).
(iii) Let C be the grid cell of Gr realizing gdr (P). Place 4 points, at the corners of C, and one point in the
center of C. Placing a disk of radius ropt (P, k) at each of those points, completely covers C, as can be easily
verified (since the side length of C is at most 2ropt (P, k)). Thus, |P ∩ C| = gdr (P) ≤ 5 depth(P, ropt (P, k)) =
5k.

1.4.1.1 Description
As in the previous sections, we construct a grid which partitions the points into small (O(k) sized) groups.
The key idea behind speeding up the grid computation is to construct the appropriate grid over several
rounds. Specifically, we start with a small set of points as seed and construct a suitable grid for this subset.
Next, we incrementally insert the remaining points, while adjusting the grid width appropriately at each
step.

Definition 1.4.3 (Gradation) Given a set P of n points, a k-gradation (S 1 , . . . , S m ) of P is a sequence of

subsets of P, such that (i) S m = P, (ii) S i is formed by picking each point of S i+1 with probability 1/2, and
(iii) |S 1 | ≤ k, and |S 2 | > k.

Lemma 1.4.4 Given P, a gradation can be computed in expected linear time.

P
Proof: Observe that the sampling time is O m i=1 |S i | , where m is the length of the sequence. Also observe
that E[|S m |] = |S m | = n and
" #
|S i+1 | 1
E[|S i |] = E E |S i | |S i+1 | = E = E[|S i |] .
2 2

Now by induction, we get

n
E[|S m−i |] = .
2i−1
hP i
Thus, the running time is O E m i=1 |S i | = O(n).

Let P = (P1 , . . . , Pm ) be a u-gradation of P (see Definition 1.4.3), where u = max(k, n/ log n). The
sequence P can be computed in expected linear time as shown in Lemma 1.4.4.
Since |P1 | = O(k), we can compute r1 , in O(|P1 | (|P1 | /k)2 ) = O(k) = O(n) time, using Lemma 1.3.1.

17
Grow(Pi ,ri−1 ,k)
Output: ri
begin
Gi−1 ← Gri−1 (i)
for every grid cluster c ∈ Gi−1 with |c ∩ Pi | ≥ k do
Pc ← c ∩ Pi
rc ← ApproxHeavy(c, k)
// ApproxHeavy is the algorithm of Lemma 1.3.1
We have ropt (c, k) ≤ rc ≤ 2ropt (c, k),

return minimum rc computed.

end

Figure 1.2: Algorithm for the ith round.

The algorithm now works in m rounds, where m is the length of the sequence P. At the end of the ith
round, we have a distance ri such that gdri (Pi ) ≤ 5k, and there exists a grid cluster in Gri containing more
than k points of Pi and ropt (Pi , k) ≤ ri .
At the ith round, we first construct a grid Gi−1 for points in Pi using ri−1 as grid width. We know that
there is no grid cell containing more than 5k points of Pi−1 . As such, intuitively, we expect every cell of
Gi−1 to contain at most 10k points of Pi , since Pi−1 ⊆ Pi was formed by choosing each point of Pi into Pi−1
with probability 1/2. (This is of course too good to be true, but something slightly weaker does hold.) Thus
allowing us to use the slow algorithm of Lemma 1.3.1 on those grid clusters. Note that, for k = Ω(n), the
algorithm of Lemma 1.3.1 runs in linear time, and thus the overall running time is linear.
The algorithm used in the ith round is more concisely stated in Figure 1.2. At the end of the m rounds we
have rm , which is a 2-approximation to the radius of the optimal k enclosing circle of Pm = P. The overall
algorithm is summarized in Figure 1.3.

1.4.1.2 Analysis
Lemma 1.4.5 For i = 1, . . . , m, we have ropt (Pi , k) ≤ ri ≤ 2ropt (Pi , k), and the heaviest cell in Gri (Pi )
contains at most 5k points of Pi .
Proof: Consider the optimal circle Di that realizes ropt (Pi , k). Observe that there is a cluster c of Gri−1 that
contains Di , as ri−1 ≥ ri . Thus, when Grow handles the cluster c, we have Di ∩ Pi ⊆ c. The first part of the
lemma then follows from the correctness of the algorithm of Lemma 1.3.1.
As for the second part, observe that any grid cell of width ri can be covered with 5 circles of radius ri /2,
and ri /2 ≤ ropt (Pi , k). It follows that each grid cell of Gri (i) contains at most 5k points.

Now we proceed to upper-bound the number of cells of Gri−1 that contains “too many” points of Pi .
Since each point of Pi−1 was chosen from Pi with probability 1/2, we can express this bound as a sum of
independent random variables, and bound this using tail-bounds.

Definition 1.4.6 For a point set P, and parameters k and r, the excess of Gr (P) is
X $ |c ∩ P| %
E(P, r) = ,
c∈Cells(G )
50k
r

where Cells(Gr ) is the set of cells of the grid Gr .

18
LinearApprox(P, k)
Output: r - a 2-approximation to ropt (P, k)
begin
Compute a gradation {P1 , . . . , Pm } of P as in Lemma 1.4.4
r1 ← ApproxHeavy(P1 , k)
// ApproxHeavy is the algorithm of Lemma 1.3.1
// which outputs a 2-approximation

for i ← 2 to m do
r j ← Grow(Pi , ri−1 , k)

for every grid cluster c ∈ Grm with |c ∩ P| ≥ k do

rc ← ApproxHeavy(c ∩ P, k)

return minimum rc computed over all clusters

end

Figure 1.3: 2-Approximation Algorithm.

Remark 1.4.7 The quantity 100k · E(P, r) is an upper bound on the number of points of P in an heavy cell
of Gr (P), where a cell of Gr (P) is heavy if it contains at least 50k points.

Lemma 1.4.8 For any positive real t, the probability that Gri−1 (i) has excess E(Pi , ri−1 ) ≥ α = t + 3 lg n ,
is at most 2−t .

Proof: Let G be the set of O(n3 ) possible grids that might be considered by the algorithm (see Remark 1.4.1),
and fix a grid G ∈ G with excess M = E(Pi , κ(G)) ≥ α, where κ(G) is the sidelength of a cell of G.

Let U = {Pi ∩ c} c ∈ G, |Pi ∩ c| ≥ 50k be all the heavy cells in G(i). Furthermore, let

[
V= P(X, 50k),
X∈U

where P(X, ν) denotes an arbitrary partition of the set X into disjoint subsets such that each one of them
contains ν points, except maybe the last subset that might contain between ν and 2ν − 1 points.
It is clear that |V| = E(Pi , `). From the Chernoff inequality, for any S ∈ V, we have µ = E[|S ∩ Pi−1 |] ≥
25k, and setting δ = 4/5 we have

! !
δ2 25k(4/5)2 1
Pr[|S ∩ Pi−1 | ≤ 5k] ≤ Pr |S ∩ Pi−1 | ≤ (1 − δ)µ < exp −µ ≤ exp − < .
2 2 2

Furthermore, since G = Gri−1 implying that each cell of G(i − 1) contains at most 5k points. Thus we have

Y 1 1 1
Pr Gri−1 = G ≤ Pr[|S ∩ Pi−1 | ≤ 5k] ≤ |V| = M ≤ α .
S ∈V
2 2 2

19
Since there are n3 different grids in G, we have
 
 
 [ 
Pr[E(Pi , ri−1 ) ≥ α] = Pr (G = Gri−1 )
 
G∈G,
E(Pi ,κ(G))≥α
X 1 1
≤ Pr G = Gri−1 ≤ n3 α ≤ t .
G∈G,
2 2
E(Pi ,κ(G))≥α

We next bound the expected running time of the algorithm LinearApprox by bounding the expected
time spent in the ith iteration. In particular, let Y be the random variable which is the excess of Gri−1 (i). In
this case, there are at most Y cells which are heavy in Gri−1 (i), and each such cell contains at most O(Yk)
points. Thus, invoking the algorithm of ApproxHeavy on such a heavy cell 2 3
takes O(Yk·((Yk)/k)
) = O(Y k)
3 4
time. Overall, the running time of Grow, in the ith iteration, is T (Y) = O |Pi | + Y · Y k = O |Pi | + Y k .
For technical reasons, we need to consider the light and heavy cases separately to bound Y. So set

Λ = 3 lg n .

The Light Case: k < Λ. We have that the expected running time is proportional to
X
dn/ke h i X
n/k
O(|Pi |) + Pr[Y = t] T (t) = O(|Pi |) + Pr 0 ≤ Y ≤ Λ T (Λ) + Pr[Y = t] T (t)
t=0 t=Λ+1

X
n/k
1
≤ O(|Pi |) + T (Λ) + t
T (t + Λ)
t=1
2
X
n/k
(t + Λ)4 k
4
= O(|Pi |) + O(k log n) +
t=1
2t

= O |Pi | + k log4 n = O(|Pi |) ,
by Lemma 1.4.8 and since T (·) is a monotone increasing function.

The Heavy Case: k ≥ Λ.

Lemma 1.4.9 The probability that Gri−1 (i) has excess larger than t, is at most 2−t , for k ≥ Λ.
Proof: We use the same technique as in Lemma 1.4.8. By the Chernoff inequality, the probability that any
50k size subset of Pi would contain at most 5k points of Pi−1 , is less than
!
16 1 1
≤ exp −25k · · ≤ exp(−5k) ≤ 4 .
25 2 n
In particular, arguing as in Lemma 1.4.8, it follows that the probability that E(Pi , k, ri−1 ) exceeds t, is smaller
than n3 /n4t ≤ 2−t .
Thus, if k ≥ Λ, the expected running time of Grow, in the ith iteration, is at most
   
 X !2  X∞ !2  
|c ∩ P |    tk  1 
O  |c ∩ Pi |
i
 = O|Pi | + t · tk ·  ·  = O(|Pi | + k) = O(|Pi |) ,

c∈G
k t=1
k  2t 
ri−1

by Lemma 1.4.9.

20
Overall Running Time Analysis Thus, by the above analysis and by Lemma 1.4.4, the total expected
P
running time of LinearApprox inside the inner loop is O i |Pi | = O(n). As for the last step, of computing
a 2-approximation, consider the grid Grm (P). Each grid cell contains at most 5k points, and hence each grid
cluster contains at most 45k points. Also the smallest k enclosing circle is contained in some grid cluster. In
each cluster that contain more than k points, we use the algorithm of Corollary 1.3.2 and finally output the
minimum over all the clusters. The overall running time is O((n/k)k) = O(n) for this step, since each point
belongs to at most 9 clusters.

Theorem 1.4.10 Given a set P of n points in the plane, and a parameter k, one can compute, in expected
linear time, a radius r, such that ropt (P, k) ≤ r ≤ 2ropt (P, k).
Once we compute r such that ropt (P, k) ≤ r ≤ 2ropt (P, k), using the algorithm of Theorem 1.4.10, we
apply an exact algorithm to each cluster of the grid Gr (P) which contains more than k points.
Matoušek presented such an exact algorithm [Mat95a], and it has running time of O(n log n + nk) and
space complexity O(nk). Since r is a 2 approximation to ropt (P, k), each cluster has O(k) points. Thus the
running time of the exact algorithm in each cluster is O(k2 ) and requires O(k2 ) space. The number of clusters
which contain more than k points is O(n/k). Hence the overall running time is O(nk), and the space used is
O(n + k2 ).

Theorem 1.4.11 Given a set P of n points in the plane and a parameter k, one can compute, in expected
O(nk) time, using O(n + k2 ) space, the radius ropt (P, k), and a circle Dopt (P, k) that covers k points of P.

1.5 Bibliographical notes

Our closest-pair algorithm follows Golin et al. [GRSS95]. This is in turn a simplification of a result of
Rabin [Rab76]. Smid provides a survey of such algorithms [Smi00]).
The proof of Lemma 1.4.2 is from [Mat95a]. The min-disk approximation algorithm follows roughly
the work of Har-Peled and Mazumdar [HM03]. Exercise 1.6.2 is also taken from there.

1.6 Exercises
Exercise 1.6.1 (Compute clustering radius.) [10 Points]
Let C and P be two given sets of points in the plane, such that k = |C| and n = |P|. Let r =
max p∈P minc∈C kc − pk be the covering radius of P by C (i.e., if we place a disk of radius r around each
point of C all those disks covers the points of P).

(A) Give a O(n + k log n) expected time algorithm that outputs a number α, such that r ≤ α ≤ 10r.

(B) For ε > 0 a prescribed parameter, give a O(n+kε−2 log n) expected time algorithm that outputs a number
α, such that α ≤ r ≤ (1 + ε)α.

Exercise 1.6.2 (Randomized k-enclosing disk.) [5 Points]

Given a set P of n points in the plane, and parameter k, present a (simple) randomized algorithm that
computes, in expected O(n(n/k)) time, a circle D that contains k points of P, and radius(D) ≤ 2ropt (P, k).
(This is a faster and simpler algorithm than the one presented in Lemma 1.3.1.)

21
22
Chapter 2

Quadtrees - Hierarchical Grids

In this chapter, we discuss quadtrees which is arguably one of the simplest and most powerful geometric
data-structure. We begin in Section 2.1 by giving a simple application of quadtrees and describe a clever
way for performing point-location queries quickly in such a quadtree. In Section 2.2, we describe how such
quadtrees can be compressed and how can they be quickly constructed and used for point-location queries.
In Section 2.3 we describe a randomized extension of this data-structure, known as skip-quadtree, which
enables us to maintain the compressed quadtree efficiently under insertions and deletions. In Section 2.5, we
turn our attention to applications of compressed quadtrees, showing how quadtrees can be used to compute
good triangulations of an input point set.

2.1 Quadtrees - a simple point-location data-structure

Let Pmap be a planar map. To be more concrete, let Pmap be a partition of the unit square into triangles (i.e.,
a mesh). The partition Pmap can represent any planar map, where a region in the map might be composed
of several triangles. For the sake of simplicity, assume that every vertex in Pmap shares at most, say, nine
triangles.
Let us assume that we want to preprocess Pmap for point-location queries. Of course, there are data-
structures that can do it with O(n log n) preprocessing time, linear space, and logarithmic query time. In-
stead, let us consider the following simple solution (which in the worst case, can be much worse).
Build a tree T, where the root corresponds to the unit square. Every node v ∈ T corresponds to a cell
2v (i.e., a square), and it has four children. The four children correspond to the four squares formed by
splitting 2v into four equal size squares, by horizontal and vertical cuts. The construction is recursive, and
we start from v = rootT . As long as the current node intersects more than, say, nine triangles, we create its
children nodes, and we call recursively on each child, with the list of input triangles that intersect its square.
We stop at a leaf, if its “conflict-list” (i.e., list of triangles it intersects) is of size at most nine. We store this
conflict-list in the leaf.
Given a query point q, in the unit square, we can compute the triangle of Pmap containing q, by traversing
down T from the root, repeatedly going into the child of the current node, whose square contains q. We stop
at soon as we reach a leaf, and then we scan the leaf conflict-list, and check which of the triangles contains
q.
Of course, in the worst case, if the triangles are long and skinny, this quadtree might have unbounded
complexity. However, for reasonable inputs (say, the triangles are fat), then the quadtree would have linear
complexity in the input size (see Exercise 2.7.1) .The big advantage of quadtrees of course, is their simplicity.
In a lot of cases, quadtree would be a sufficient solution, and seeing how to solve a problem using a quadtree
might be a first insight into a problem.

23
2.1.1 Fast point-location in a quadtree
One possible interpretation of quadtrees is that they are a multi-grid representation of a point-set. In partic-
ular, given a node v, with a square S v , which is of depth i (the root has depth zero), then the side length of
S v is 2−i , and it is a square in the grid G2−i . In fact, we will refer to `(v) = −i as the level of v. However,
a cell in a grid has a unique ID made out of two integer numbers. Thus, a node v of a quadtree is uniquely
defined by the triple id(v) = (`(v), bx/rc , by/rc), where (x, y) is any point in 2v , and r = 2`(v) .
Furthermore, given a query point q, and a de- QTFastPntLocInner(T, q, lo, hi).
sired level `, we can compute the ID of the quadtree mid ← b(lo + hi)/2c
cell of this level that contains q in constant time. v ← QTGetNode(T, q, mid)
Thus, this suggests a very natural algorithm for if v = null then
doing a point-location in a quadtree: Store all the return QTFastPntLocInner(T, q, lo, mid − 1).
IDs of nodes in the quadtree in a hash-table, and w ← Child(v, q)
also compute the maximal depth h of the quadtree. //w is child of v containing the point q.
Given a query point q, we now have access to If w = null then
any node along the point-location path of q in T, return v
in constant time. In particular, we want to find return QTFastPntLocInner(T, q, mid + 1, hi)
the point in T where the point-location path “falls
off” the quadtree. This we can find by perform- Figure 2.1: One can perform point-location in
ing a binary search for the dropping off point. Let a quadtree T by calling QTFastPntLocInner
QTGetNode(T, q, d) denote the procedure that, in (T, q, 0, height(T )).
constant time, returns the node v of depth d in the
quadtree T such that 2v contains the point q. Given a query point q, we can perform point-location in T by
calling QTFastPntLocInner(T, q, 0, height(T )). See Figure 2.1 for the pseudo-code for QTFastPntLocIn-
ner.

Lemma 2.1.1 Given a quadtree T of size n and of height h, one can preprocess it in linear time, such that
one can perform a point-location query in T in O(log h) time. In particular, if the quadtree has height
O(log n) (i.e., it is “balanced”), then one can perform a point-location query in T in O(log log n) time.

2.2 Compressed Quadtrees: Range Searching Made Easy

Definition 2.2.1 (Spread.) For a set P of n points in a metric space, let

max p,q∈P kp − qk
Φ(P) =
min p,q∈P,p,q kp − qk

be the spread of P. In words, the spread of P is the ratio between the diameter of P and the distance between
the two closest points. Intuitively, the spread tells us the range of distances that P posses.

One can build a quadtree T for P, storing the points of P in the leaves of T, where one keep splitting a
node as long as it contains more than one point of P. During this recursive construction, if a leaf contains
no points of P, we save space by not creating this leaf, and instead creating a null pointer in the parent node
for this child.

Lemma 2.2.2 Let P be a set of n points in the unit square, such that diam(P) = max p,q∈P kp − qk ≥ 1/2. Let
T be a quadtree of P constructed over the unit square. Then, the depth of T is bounded by O(log Φ(P)), it

can be constructed in O(n log Φ(P)) time, and the total size of T is O n log Φ(P) .

24
Proof: The construction is done by a straightforward recursive algorithm as described above.
Let us bound the depth of T. Consider any two points p, q ∈ P, and observe that a node v of T of level

u = lg kp − qk − 1 containing √ p must not
√ contain q (we remind the reader that lg n = log2 n). Indeed, the
diameter of 2v is smaller than 22u < 2 kpqk /2 < kp − qk. Thus, 2v can not contain both p and q. In

particular, any node of T of level r = − lg Φ − 1 can contain at most one point of P, where Φ = Φ(P).
Thus, all the nodes of T are of depth O(log Φ).
Since the construction algorithm spends O(n) time at each level of T, it follows that the construction
time is O(n log Φ), and this also bounds the size of the quadtree T.
The bounds of Lemma 2.2.2 are tight, as one can easily verify, see Exercise 2.7.2. But in fact, if you
inspect a quadtree generated by Lemma 2.2.2, you would realize that there are a lot of nodes of T which are
of degree one (the degree of a node is the number of children it has). Indeed, a node v of T has degree larger
than one, only if it has two children, and let Pv be the subset of points of P stored in the subtree of v. Such
a node v splits Pv into, at least, two subsets and globally there can be only n − 1 such splitting nodes.
Thus, a quadtree T contains a lot of “useless” nodes. We can replace such
a sequence of edges by a single edge. To this end, we will store inside each
quadtree node v, its square 2v , and its level `(v). Given a path of vertices in the
quadtree that are all of degree one, we will replace them with a single vertex
that corresponds to the first vertex in this path, and its only child would be the
last vertex in this path (this is the first node of degree larger than one). This
compressed node has a single child, and the region rgv that it is in “charge” of
is an annulus, see the figure on the right. Otherwise, the region that a node is
in charge of is a rgv = 2v . The child corresponds to the inner square. We call
the resulting tree a compressed quadtree. Since any node that has only a single
child is compressed, we can charge it to its parent, which has two children. Since there are at most n − 1
internal nodes in the new compressed quadtree that have degree larger than one, it follows that it has linear
size (however, it still can have linear depth in the worst case).
As an application for such a compressed quadtree, consider the problem of counting how many points
are inside a query rectangle r. We can start from the root of the quadtree, and recursively traverse it, going
down a node only if its region intersects the query rectangle. Clearly, we will report all the points contained
inside r. Of course, we have no guarantee about the query time, but in practice, this might be fast enough.

2.2.1 Efficient construction of compressed quadtrees

Let P be a set of n points in the unit square, with unbounded spread. We are interested in computing the
compressed quadtree of P. The regular algorithm for computing a quadtree when applied to P might required
unbounded time. Modifying it so it requires only quadratic time is an easy exercise.
Instead, compute in linear time a disk D of radius r, which contains at least n/10 of the points of P, such
that r ≤ 2ropt (P, n/10), where ropt (P, n/10) denotes the radius of the smallest disk containing n/10 points.
Computing D can be done in linear time, by a rather simple algorithm (Lemma 1.3.1).
Let l = 2blg rc . Consider the grid Gl . It has a cell that contains (n/10)/25 points (since D is covered
by 5 × 5 = 25 grid cells of Gl , since l ≥ r/2), and no grid cell contains more than 5(n/10) points, by
Lemma 1.4.2 (iii). Thus, compute Gl (P), and find the cell c containing the largest number of points. Let
Pin be the points inside this cell c, and Pout the points outside this cell. We know that |Pin | ≥ n/250, and
|Pout | ≥ n/2. Next, compute the compressed quadtrees for Pin and Pout , respectively, and let Tin and Tout
denote the respective quadtrees. Since the cell of the root of Tin has side length which is a power of two, and
it belongs to the grid Gl , it follows that c represents a valid region, which can be a node in Tout (note that if
it is a node in Tout , then it is empty). Thus, we can do a point-location query in Tout , and hang the root of Tin

25
in the appropriate node of Tout . This takes linear time (ignoring the time to construct Tin and Tout ). Thus,
the overall construction time is O(n log n).

Theorem 2.2.3 Given a set P of n points in the plane, one can compute a compressed quadtree of P in
O(n log n) deterministic time.

Definition 2.2.4 (Canonical square and grid.) A square is a canonical square, if it is contained inside the
unit square, it is a cell in a grid Gr , and r is a power of two (i.e., it might correspond to a node in a quadtree).
We will refer to such a grid Gr , as a canonical grid.
For reasons that would become clear later, we want to construct the quadtree out of a list of quadtree
nodes that must appear in the quadtree. Namely, we get a list of canonical grid cells that must appear in the
quadtree (i.e., the level of the node, together with its grid ID).

Lemma 2.2.5 Given a list C of n canonical squares, all lying inside the unit square, one can construct a
compressed quadtree T such that for any square c ∈ C, there exists a node v ∈ T , such that 2v = c. The
construction time is O(n log n).
Proof: The construction is similar to Theorem 2.2.3. Let P be a set of n points, where pc ∈ P, if c ∈ C,
and pc is the center of c. Next, find, in linear time, a canonical square C that contains at least n/250 points
of P, and at most n/2 points of P. Let U be the list of all squares of C that contain c, let Cin be the list
of squares contained inside c, and let Cout be the list of squares of C that do not intersect the interior of c.
Recursively, build a compressed quadtree for Cin ‘ and Cout , denoted by Tin and Tout , respectively.
Next, sort the nodes of U in decreasing order of their level. Also, let π be the point-location path of c in
Tout . Clearly, adding all the nodes of U to Tout is no more than performing a merge of π together with the
sorted nodes of U. Whenever we encounter a square of U that does not have a corresponding node at π, we
create this node, and insert it into π. Let Tout 0 denote the resulting tree. Next, we just hang T in the right
in
place in Tout0 . Clearly, the resulting quadtree has all the squares of C as nodes.

As for the running time, we have T (C) = T (Cin ) + T (Cout ) + O(n) + O(|U| log |U|) = O(n log n), since
|Cout | + |Cin | + |U| = n and |Cin | , |Cout | ≤ (249/250)n.

2.2.2 Fingering a Compressed Quadtree - Fast Point Location

Let T be a compressed quadtree of size n. We would like to preprocess it so that given a query point, we can
find the lowest node of T whose cell contains a query point q. As before, we can perform this by traversing
down the quadtree, but this might require Ω(n) time. Since the range of levels of the quadtree nodes is
unbounded, we can no longer use binary search on the levels of T to answer the query.
Instead, we are going to use a rebalancing technique on T. Namely, we are going to build a balanced
tree T 0 , which would have cross pointers (i.e., fingers) into T. The search would be performed on T 0 instead
of on T. In the literature, the tree T is known as a finger tree.

Definition 2.2.6 Let T be a tree with n nodes. A separator in T is a node v, such that if we remove v from
T, we remain with a forest, such that every tree in the forest has at most dn/2e vertices.

Lemma 2.2.7 Every tree has a separator, and it can be computed in linear time.
Proof: Consider T to be a rooted tree, and initialize v to be the root of T. We perform a walk on T. If v
is not a separator, then one of the children of v in T must have a subtree of T of size ≥ dn/2e nodes. Set v to
be this node. Continue in this walk, till we get stuck. The claim is that v is the required node. Indeed, since
we always go down, and the size of the subtree shrinks, we must get stuck. Thus, consider w as the node

26
we got stuck at. Clearly, the subtree of w contains at least dn/2e nodes (otherwise, we would not set v = w).
Also, all the subtrees of w have size ≤ dn/2e, and the connected component of T \ {w} containing the root
contains at most n − dn/2e ≤ bn/2c nodes. Thus, w is the required separator.
This suggests a natural way for processing a compressed quadtree for point-location queries. Find a
separator v ∈ T , and create a root node fv for T 0 which has a pointer to v; now recursively build finger trees
to each tree of T \ {v}, and hang them on w. Given a query point q, we traverse T 0 , where at node fv ∈ T 0 , we
check whether the query point q ∈ 2v , where v is the corresponding node of T. If q < 2v , we continue the
search into the child of fv , which corresponds to the connected component outside 2v that was hung on fv .
Otherwise, we continue into the child that contains q. This takes constant time per node. As for the depth
for the finger tree T 0 , observe D(n) ≤ 1 + D(dn/2e) = O(log n). Thus, a point-location query in T 0 takes
logarithmic time.

Theorem 2.2.8 Given a compressed quadtree T of size n, one can preprocess it in O(n log n) time, such that
given a query point q, one can return the lowest node in T whose region contains q in O(log n) time.

2.3 Dynamic Quadtrees

What if we want to maintain the compressed quadtree under insertions and deletions? There is an elegant
way of doing this by using randomization. The resulting structure has similar behavior to skip-list where
instead of linked list we use quadtrees.
We remind the reader the concept of gradation:

Definition 2.3.1 (Gradation.) Given a set P of n points, a sampling sequence (S m , . . . , S 1 ) of P is a sequence

of subsets of P, such that (i) S 1 = P, (ii) S i is formed by picking each point of S i−1 with probability 1/2, and
(iii) |S m | ≤ 2k, and |S m−1 | > 2k, where k is some prespecified constant. The sequence (S m , S m−1 , . . . , S 1 ) is
called a gradation of P.

Let T1 , . . . , Tm be the quadtrees of the sets P = S 1 , . . . , S m , respectively. Note, that the nodes of Ti are
a subset of the nodes appear in Ti−1 . As such, every node in Ti would have pointers to its own copy in Ti−1
and a pointer to its copy in Ti+1 if it exists there. We will refer to this data-structure as skip-quadtree.

Point-location queries. Given a query point q we want to find the leaf of T1 that contains it. The search
algorithm is quite simple, starting at Tm you find the leaf in Ti that contains the query point, and then move
to the corresponding node in Ti−1 , and continue the search from there.

2.3.1 Inserting a point into the skip-quadtree.

Let p be the point to be inserted into the skip-quadtree. We perform a point-location query and find the
lowest node v in T1 that contains p. Next, we split v and establish a new node (hanging it from v) that
contains p. Now, we flip an unbiased coin, if the coin comes up tail, we are done. Otherwise, we add p to
T2 . We continue in this fashion, adding p to the quadtrees in the relevant levels, till the coin comes up tail.
Note, that the amount of work and space needed at each level is a constant (ignoring the initial point-
location query), and by implementing this operation carefully, the time to perform it would be proportional
to the point-location query time.
Deleting a point from the skip-quadtree is done in a similar fashion to the insertion described above.
Given a point-set P, constructing the skip-quadtree can be done by inserting the points of P one by one
into the skip-quadtree.

27
2.3.2 Running time analysis
We analyze the time needed to perform a point-location query. In particular, we claim that the expected
query time O(log n). To see that we will use the standard backward analysis. Consider the leaf v of Ti that
contains q, and consider the path v = v1 , v2 , . . . , vr from v to the root vr of the compressed quadtree Ti . Let
π denote this path. Clearly, the amount of time spent in the search in the tree Ti , is proportional to how far
we have to go on this list, till we hit a node that appears in Ti+1 . (During the point-location we traverse this
snippet of the path in the other direction - we are in a leaf of Ti+1 , we jump into the corresponding node
of Ti , and traverse down till we reach a leaf.) Note, that the node v j stores (at least) j points of S i , and if
any pair of them appears in S i+1 then at least one of the nodes on π below the node v j would appear in Ti+1
(since the quadtree Ti+1 needs to “separate” these two points and this is done by a node of the path). Let
denote this set of points by U j , and let X j be an indicator variable which is one if v j does not appear in Ti+1 .
Clearly,
h i h i j+1
E X j = Pr no pair of points of U j is in T i+1 ≤ ,
2j
since the points of S i are randomly and independently chosen to be in S i+1 and the event happens only if
zero or one points of U j are in S i+1 . hP i P h i P
Thus, the expected search time in T i is E j X j = j E X j = j ( j + 1)/2 j = O(1).
Thus, the overall expected search time in the skip-quadtree is proportional to the number of levels in the
gradation.
Let Zi = |S i | be the number elements stored in the ith level of the gradation. We know that Z1 = n,
and E[Zi ] =. In particular, E[Zi ] = E[E[Zi ]] = E[Zi−1 /2] = · · · = n/2i−1 . Thus, E[Zα ] ≤ 1/n10 , where

α = 11 lg n . Thus, by Markov’s inequality, we have that

E[Zα ] 1
Pr[m > α] = Pr[Zα ≥ 1] ≤ = 10 .
1 n
We summarize:

Lemma 2.3.2 A gradation defined over n elements has O(log n) levels both in expectation and with high
probability.

This implies that a point-location query in the skip quadtree takes, in expectation, O(log n) time.
Since, with high probability, there are only O(log n) levels in the gradation, it follows that the expected
search time is O(log n).

High Probability. In fact, one can show that this point-location query time bound holds with high proba-
bility, and we sketch (informally) the argument why this is true. Consider the variable Yi that is the number
P
of nodes of T i being visited during the point-location query. We have that Pr[Yi ≥ k] ≤ j=k (k + 1)/2k =
O(k/2k ) = O(1/ck ), for some constant c > 1. Thus, we have a sum of logarithmic number of independent
random variables, each one of them behaves like a variable with geometric distribution. As such, we can
apply a Chernoff-type inequality (see Exercise 25.4.3) to get an upper bound on the probably that the sum
of these variables exceeds O(log n). This probability is bounded by 1/nO(1) .
Note, the longest query time is realized by one of the points stored in the quadtree. Since there are n
points stored in the quadtree, this implies that with high probability all point-location queries takes O(log n)
time. Also, observe that the structure of the skip-quadtree is uniquely determined by the gradation. Since
the gradation is oblivious to the history of the data-structure (i.e., what points where inserted and deleted).
As such, these bounds on the performance hold at any point in time during the usage of the skip-quadtree.
We summarize:

28
Theorem 2.3.3 Let T be an empty skip-quadtree used for a sequence of n operations (i.e., insertions, dele-
tions and point-location queries). Then, with high probability (and thus also in expectation), the time to
perform each such operation takes O(log n) time.

2.4 Even More on Dynamic Quadtrees

The previous section reveals that quadtrees can be maintained dynamically using ideas similar to skip-lists.
Here we will take this idea even further, showing how to facilitate quadtrees using any data-structure for
ordered sets.

2.4.1 Ordering of nodes and points

So consider a quadtree T, and a DFS traversal of T, where the DFS always traverse the
children of a node in the same relative order (i.e., say, first the bottom-left child, then the
bottom-right child, top-left child, and top-right child).
Consider any two canonical squares 2 and 2 b, and imagine a quadtree T that contains
both squares (i.e., there are nodes in T with these squares as their cells). Notice, that the
above DFS would always visit these two nodes in a specific order, independent of the structure of the rest of
the quadtree. Thus, if 2 gets visited before 2 b, we denote this fact by 2 ≺ 2 b. This defines a total ordering
over all canonical squares. It would be in fact useful to extend this ordering to also includes points. Thus,
consider a point p and a canonical square 2. If p ∈ 2 then we will say that 2 ≺ p. Otherwise, if 2 ∈ Gi , let
b be the cell in Gi that contains p. We have that 2 ≺ p if and only if 2 ≺ 2
2 b. Next, consider two points p and
q, and let Gi be a grid fine enough such that p and q lie in two different cells, say, 2 p and 2q , respectively.
Then p ≺ q if and only if 2 p ≺ 2q .
We will refer to the ordering induced by ≺ as the Q-order.
The ordering ≺ when restricted only to points, is the ordering along a space
filling mapping that is induced by the quadtree DFS. This ordering is know as
the Z-order . Note, however, that since we allow comparing cells to cells, and
cells to points, the Q-order no longer has this exact interpretation. Furthermore,
unlike the Peano or Hilbert curve, our mapping is not continuous. Our mapping
has the advantage of being easy to define. Indeed, given a real number α ∈
P
[0, 1), with the binary expansion α = 0.x1 x2 x3 . . . (i.e., α = ∞ −i
i=1 xi 2 ), our
mapping will map it to the point (0.x2 x4 x6 . . . , 0.x1 x3 x5 . . .).

2.4.1.1 Computing the Q-order quickly

For our algorithmic applications, we need to be able to find the ordering ac-
cording to ≺ between any two given cells/points quickly. To this end, let LCA(p, q) of two points p, q ∈
[0, 1]2 denote the smallest canonical square that contains both p and q. To compute this, let bit∆ (α, β) of
two real numbers α, β ∈ [0, 1) be the index of the first bit after the period in which they differ. Thus,
bit∆ (1/4, 3/4) = 1 and bit∆ (7/8, 3/4) = 3. Clearly, the level ` of 2 = LCA(p, q) is equal to

` = min(bit∆ (x p , xq ), bit∆ (y p , yq )) − 1,

where x p and y p denote the x and y coordinates of p, respectively. Thus, the side length of 2 = LCA(p, q)
is ∆ = 2−` . Let x0 = bx/∆c and y0 = by/∆c. Thus,

LCA(p, q) = [x0 , x0 + ∆) × [y0 , y0 + ∆).

29
The LCA of two cells is just the LCA of their centers.
Now, given two cells 2 and 2 b, we would like to determine their Q-order. If 2 ⊆ 2b then 2
b ≺ 2. If
b ⊆ 2 then 2 ≺ 2
2 b. Otherwise, let 2e = LCA(2, 2b). We can now determine which children of 2e contains
these two cells, and since we know the traversal ordering among children of a node in a quadtree we can
now resolve this query in constant time.

Corollary 2.4.1 Assuming that the bit∆ operation and the b·c operation can be performed in constant time,
then one can compute LCA of two points (or cells) in constant time. Similarly, the Q-order can be resolved
in constant time.

Computing bit∆ efficiently. It seems somewhat suspicious that one assumes that the bit∆ operations can
be done in constant time on a classical RAM machine. However it is a reasonable assumption on a real
world computer. Indeed, in floating point representation, once you are given a number it is easy to access its
mantissa and exponent in constant time. If the exponents are different then bit∆ can be computed in constant
time. Otherwise, we can easily xor the mantissas of both numbers, and compute the most significant bit
that is on. This can be done in constant time by converting the xored mantissa into floating point number,
and computing its log2 (some CPUs have this command built in). Observe, that all these operations are
implemented in hardware in the CPU and require only constant time.

2.4.2 Performing a point-location in a quadtree

Let T be a given quadtree and a query point q ∈ [0, 1]2 . We would like to find the leaf v of T such that its
cell contains q. We assume that T is given to us as a list of cells stored in an ordered-set data-structure, using
the Q-order over the cells.
To answer the query, we first find the two consecutive cells in this list such that 2 ≺ q ≺ 2 b. It is now
easy to verify that 2 must be the quadtree leaf containing q. Indeed, let 2q be the leaf of T that its cell
contains q. By definition, we have that 2q ≺ q. Thus, the only bad scenario is that 2q ≺ 2 ≺ q. But this
implies, by the definition of Q-order, that 2 must be contained inside 2q contradicting our assumption that
2q is a leaf of the quadtree.

Lemma 2.4.2 Given a quadtree T of size n, with its leaves stored in an ordered-set data-structure D ac-
cording to the Q-order, then one can perform point-location query in O(Q(n)) time, where Q(n) is the time
to perform a search query in D.

2.4.3 Overlaying two quadtrees

Given two quadtrees T 0 and T 00 we would like to overlay them to compute their combined quadtree. This is
the minimal quadtree such that every cell of either T 0 or T 00 appears in it. Observe that if the two quadtrees
are given as sorted lists of their cells (ordered by the Q-order) then their overlay is just the merged list, with
replication removed.

Lemma 2.4.3 Given two quadtrees T 0 and T 00 given as sorted lists of their nodes, one can compute the
merged quadtree in linear time (in their size) by merging the two sorted lists and removing duplicates.

2.4.4 Point location in a compressed quadtree

so let T be a compressed quadtree, that its nodes are stored in an ordered-set data-structure. Let q be the
query point. The required node v has q ∈ rgv and is either a leaf of the quadtree or a compressed node.

30
If the binary search using the Q-order return a node v, such that 2v ≺ q then if q ∈ 2v then we are done,
as v is the required answer. So, if q < 2v then it must be that the node u such that q ∈ rgu is a compressed
node. As such, consider the cell 2 = LCA(2v , q). Clearly, the compressed node w that its region contains
q have the property that 2 ⊆ 2w . Furthermore, let z be the only child of w. We have that 2z ⊆ 2 ⊆ 2w .
In particular, in the ordering of nodes by the Q-order we have that 2w ≺ 2 ≺ 2z , where 2w and 2z are
consecutive in the ordering of the nodes of the compressed quadtree. It follows, that we can find 2w by
doing an additional binary search for 2 in the ordered set of the nodes of the compressed quadtree. We
summarize:

Lemma 2.4.4 Given a compressed quadtree T of size n, with its leaves stored in an ordered-set data-
structure D according to the Q-order, then one can perform point-location query in T in O(Q(n)) time,
where Q(n) is the time to perform a search query in D.

2.4.5 Inserting/deleting a point into/from a compressed quadtree

Let q be a point to be inserted into the quadtree, and let w be the node of the compressed quadtree such that
q ∈ rgw . There are several possibilities:

• The node w is a leaf, and there is no point associated with it. Then we just stored p at w, and we are
done.

• The node w is a leaf, and there is a point p already stored in w. In this case, let 2 = LCA(p, q),
and insert 2 into the compressed quadtree. Furthermore, split 2 into its children, and also insert the
children into the compressed quadtree. Finally, associate p with the new leaf that contains it, and
associate q with the leaf that contains it. Note, that because of the insertion w becomes a compressed
node if 2w , 2, and it becomes a regular internal node otherwise.

• The node w is a compressed node. Let z be the child of w, and consider 2 = LCA(2z , q). Insert 2
into the compressed quadtree if 2 , 2w (note that in this case w would still be a compressed node, but
with a larger “hole”). Also insert all the children of 2 into the quadtree, and store p in the appropriate
child. Hang 2z from the appropriate child, and turn this child into a compressed node.

In all three cases, the insertion requires a constant number of search/insert operations on the ordered-set
data-structure.
Deletion is done in a similar fashion. We delete the point from the node that contains it, and then we
trim away nodes that are no longer necessary.

Theorem 2.4.5 Assuming one can compute the Q-order in constant time, then one can maintain a com-
pressed quadtree of point-set in O(log n) time per operation, where insertion, deletion and point-location
queries are supported. Furthermore, this can implemented using an ordered-set data-structure.

2.5 Balanced quadtrees, and good triangulations

The aspect ratio of a convex body is the ratio between its longest dimension and its shortest dimension. For
a triangle 4 = abc, the aspect ratio Aratio (4) is the length of the longest side divided by the height of the
triangle on the longest edge.

Lemma 2.5.1 Let φ be the smallest angle for a triangle. We have that 1/ sin φ ≤ Aratio (4) ≤ 2/ sin φ.

31
Proof: Consider the triangle 4 = 4abc.
C

b a
h
φ
A c B

We have Aratio (4) = c/h. However, h = b sin φ, and since a is the shortest edge in the triangle (since it is
facing the smallest angle), it must be that b is the middle length edge. As such, 2b ≥ a + b ≥ c. Thus,
Aratio (4) ≥ b/h = b/(b sin φ) = 1/ sin φ. And similarly, Aratio (4) ≤ 2b/h = 2b/(b sin φ) = 2/ sin φ.
Another natural measure of sharpness is the edge ratio Eratio (4), which is the ratio between a triangle’s
longest and shortest edges. Clearly, Aratio (4) > Eratio (4), for any triangle 4. For a triangulation M, we
denote by Aratio (M) the maximum aspect ratio of a triangle in M. Similarly, Eratio (M) denotes the maximum
edge ratio of a triangle in M.

Definition 2.5.2 A corner of a quadtree cell is one of the four vertices of its square. The corners of the
quadtree are the points that are corners of its cells. We say that the side of a cell is split if either of the
neighboring boxes sharing it is split. A quadtree is balanced if any side of an unsplit cell may contain only
one quadtree corner in its interior. Namely, adjacent leaves are either of the same level, or of adjacent levels.

Lemma 2.5.3 Let P be a set of points in the plane, such that diam(P) = Ω(1) and Φ = Φ(P). Then, one can
compute a (minimal size) balanced quadtree T of P, in time O(n log n + m) time, where m is the size of the
output quadtree.

Proof: Compute a compressed quadtree T of P in O(n log n) time. Next, we traverse T, and replace
every compressed edge of T by the sequence of quadtree nodes that defines it. To guarantee the balance
condition, we create a queue of the nodes of T, and store the nodes of T in a hash table, with their IDs.
We handle the nodes in the queue, one by one. For a node v, we check whether the current adjacent
nodes to 2v are balanced. Specifically, let c be one of 2v ’s neighboring cells in the grid of 2v , and let c p
be the square containing c in a grid one level up. We compute id(c), id(c p ), and check if there is a node in
T with those IDs. If not, we create a node w with region c p and id(c p ), and recursively retrieve its parent
(i.e., if it exists we retrieve it, otherwise, we create it), and hang w from the parent node. We credit the work
involved in creating w to the output size. We add all the new nodes to the queue. We repeat the process till
the queue is empty.
Since the algorithm never creates nodes smaller than the smallest cell in the original compressed quadtree,
it follows that this algorithm terminates. It is also easy to argue by induction that any balanced quadtree of
P must contain all the nodes we created. Overall, the running time of the algorithm is O(n log n + m), since
the work associated with any newly created quadtree node is constant.

Definition 2.5.4 The extended cluster of a cell c in a quadtree T is the set of 5 × 5 neighboring cells of c in
the grid containing c, which are all the cells in distance < 2l from c, where l is the sidelength of c.
A quadtree T over a point set P is well-balanced, if it is balanced, and for every leaf node v that contains
a (single) point of P, we have the property that all the nodes of the extended cluster of v are leaves in T (i.e.,
none of them is split and has children), and they do not contain any other point of P. In fact, we will also
require that for every non-empty node v, all the nodes of the extended cluster of v are nodes in the quadtree.

Lemma 2.5.5 Given a point set P of n points in the plane, one can compute a well-balanced quadtree of P
in O(n log n + m) time, where m is the size of the output quadtree.

32
Figure 2.2: A well balanced triangulation.

Proof: We compute a balanced quadtree T of P. Next, for every leaf node v of T which contains a point
of P, we verify that all its extended cluster are leaves of T. If any other of the nodes of the extended cluster
of v contains a point of P, we split v. If any of the extended cluster nodes is missing as a leaf, we insert
it into the quadtree (with its ancestors if necessary). We repeat this process till we stop. Of course, during
this process, we keep the balanced property valid, by adding necessary nodes. Clearly, all this work can
be charged to newly created nodes, and as such takes linear time in the output size once the compressed
quadtree is computed.
A well-balanced quadtree T of P provides for every point, a region (i.e., extended cluster) where it is
well protected from other points. It is now possible to turn the partition of the plane induced by the leaves
of T into a triangulation of P.
We “warp” the quadtree framework as follows. Let y be the corner nearest x of the leaf of T containing
x; we replace y by x as a corner of the quadtree. Finally, we triangulate the resulting planar subdivision.
Unwarped boxes are triangulated with isosceles right triangles by adding a point in the center. Only boxes
with unsplit sides have warped corners; for these we choose the diagonal that gives better aspect ratio.
Figure 2.2 shows a triangulation resulting from a variant of this method.

Lemma 2.5.6 The method above gives a triangulation QT (P) with Aratio (QT (P)) ≤ 4.

Proof: The right triangles used to triangulate the unwarped cells have aspect ratio 2. If a cell with side
length l is warped, we have two cases.
In the first case, the input point of P is inside the square of the original cell. Then we assume that the
diagonal touching the warped point is chosen; otherwise, the aspect ratio can only be better than what we
prove. Consider one of the two triangles formed, with corners the input point and two other cell corners.
The maximum
√ length hypotenuse is formed when the warped point is on its original location, and has length
h = 2l. The minimum area is formed when the point is in the center of the square, and has area a = l2 /4.
Thus, the minimum height of such a triangle 4 is ≥ 2a/h, and Aratio (4) ≤ h/(2a/h) = h2 /2a = 4.
In the second case, the input point is outside the original square. Since the quadtree is well balanced, the
new point y is somewhere inside a square of sidelength l centered at x (since we always move the closest leaf
corner to the new point). In this case, we assume that the diagonal not touching the warped point is chosen.
This divides the cell into an isosceles right triangle and another triangle. If the chosen diagonal is the longest
edge of the other triangle, then one can argue as before, and the aspect ratio is bounded by 4. Otherwise, the

33
y

z h0 x

(0, 0) w = (l, 0)

Figure 2.3: Illustration of the proof of Lemma 2.5.6.

longest edge touches the input point. The altitude is minimized when the triangle is isosceles
√ with as sharp
an angle as possible; see Figure 2.3. Using the notation of Figure 2.3, we have y = (l/2, 7l/2). Thus,

1 0 l √
1 1 l √ −l 7−1 2
µ = area(4wyz) = 1 l √ 0 = = l .
2
2 l/2 ( 7/2 − 1)l 4
1 l/2 ( 7/2)l
√ √ √
We have h 2l/2 = µ, and thus h0 = 2µ/l = 7−1 √ l. The longest distance y can be from w is α =
2 2 √ √
p √
(1/2)2 + (3/2)2 l = ( 10/2)l. Thus, the aspect ratio of the new triangle is bounded by α/h0 = 10/2 / 7−1 √ ≈
2 2
2.717 ≤ 4.
For a triangulation M, let |M| denote the number of triangles of M. The Delaunay triangulation of
a point set is the triangulation formed by all triangles defined by the points such that their circumscribing
triangles are empty (the fact that this collection of triangles forms a triangulation requires a proof). Delaunay
triangulations are extremely useful, and have a lot of useful properties. We denote by DT (P) the Delaunay
triangulation of P.
P
Lemma 2.5.7 There is a constant c0 , independent of P, such that |QT (P)| ≤ c0 4∈DT (P) log Eratio (4).

Proof: For this lemma, we modify the description of our algorithm for computing QT (P). We compute
the compressed quadtree T 00 of P, and we uncompress the edges by inserting missing cells. Next, we split
a leaf of T 00 if it has side length κ, it is not empty (i.e., it contains a point of P), and there is another point
of P of distance ≤ 2κ from it. We refer to such a node as being crowded. We repeat this, till there are no
crowded leaves. Let T 0 denote the resulting quadtree. We now iterate over all the nodes v of T 0 , and insert
all the nodes of the extended cluster of v into T 0 . Let T denote the resulting quadtree. It is easy to verify that
T is well-balanced, and identical to the quadtree generated by the algorithm of Lemma 2.5.5 (although it is
unclear how to implement the algorithm described here efficiently).
Now, all the nodes of T that were created when adding the extended cluster nodes can be charged to
nodes of T 0 . Therefore we need only count the total number of crowded cells in T 0 .
Linearly many crowded cells have more than one child with points in them. It can happen at most
linearly many times that a non-empty cell c has a point of P outside it of distance 2κ from it, which in the
next level is in a cell non-adjacent to the children of c, where κ is the side length of the cell, as this point
becomes further away due to the shrinking sizes of cells as they split.
If a cell b containing a point is split because an extended neighbor was split, but no extended neighbor
contains any point, then, when either b or b’s parent was split, a nearby point became farther away than 2κ.
Again, this can only happen linearly many times.
Finally a cell may contain two points, or several extended neighbor cells may contain points, and this
situation may persist when the cells split. If splitting the children of the cell or of its neighbors separates the

34
points, we can charge linear total work. Otherwise, let Y be a maximal set of points in the union of cell b
and its neighbors, such that splitting b, its neighbors, or the children of b and its neighbors does not further
divide Y . Then some triangle of DT (P) connects two points y1 and y2 in Y with a point z outside Y.
Each split not yet accounted for occurs between the step when Y is separated from z, and the step when

y1 and y2 become more than 2κ units apart. These steps are at most O log Eratio (4y1 y2 z) quadtree levels
apart, so we can charge all the crowded cells caused by Y to 4y1 y2 z. This triangle will not be charged by
any other cells, because once we perform the splits charged to it all three points become far away from each
other in the quadtree.
Therefore the number of crowded cells can be counted as a linear term, plus terms of the form O(log Eratio (4abc))
for some Delaunay triangles 4abc.

Theorem 2.5.8 Given any point set P, we can find a triangulation QT (P) such that each point of P is a
vertex of QT (P) and Aratio (QT (P)) ≤ 4. There is a constant c00 , independent of P, such that if M is any
triangulation containing the points of P as vertices, |QT (P)| ≤ c00 |M| log Aratio (M).
In particular, any triangulation with constant aspect ratio containing P is of size Ω(QT (P)). Thus, up
to a constant, QT (P) is an optimal triangulation.

Proof: Let Y be the set of vertices of M. Lemma 2.5.7 states that there is a constant c such that
P
|QT (Y)| ≤ c 4∈DT (Y) log Eratio (4). The Delaunay triangulation has the property that it maximizes the
minimum angle of the triangulation, among all triangulations of the point set [For97].
If Y = P, then using this maxminŋangle property, we have Aratio (M) ≥ 21 Aratio (DT (P)) ≥ 12 Eratio (DT (P)),
by Lemma 2.5.1. Hence
X
|QT (P)| ≤ c log Eratio (DT (P)) = c |M| Eratio (DT (P)) ≤ 2c |M| Aratio (M).
4∈DT (P)

Otherwise, P ⊂ Y. Imagine running our algorithm on point set Y, and observe that |QT (P)| ≤ |QT (Y)|.
By the same argument as above, |QT (Y)| ≤ c |M| log Aratio (M).

Corollary 2.5.9 |QT (P)| = O(nlogAratio (DT (P))).

Corollary 2.5.9 is tight, as can be easily verified.

2.6 Bibliographical notes

The authoritative text on quadtrees is the book by Samet [Sam89]. The idea of using hashing in quadtrees in
a variant of an idea due to Van Emde Boas, and is also used in performing fast lookup in IP routing (using
PATRICIA tries which are one dimensional quadtrees [WVTP97]), among a lot of other applications.
The algorithm described, in Section 2.2.1, for the efficient construction of compressed quadtrees is new,
as far as I know. The classical algorithms for computing compressed quadtrees efficiently achieve the same
running time, but require considerably more careful implementation, and paying careful attention to details
[CK95, AMN+ 98]. The idea of fingering a quadtree is from [AMN+ 98] (although their presentation is
different than ours).
The elegant skip-quadtree is from the recent work of Eppstein et al. [EGS05].

To see that, observe that there must be an edge connecting a point y1 ∈ Y with a point z ∈ P \ Y (since the triangulation is
connected). Next, by going around y1 and the points it is connected to, it is easy to observe that since Y diameter is (considerably)
smaller then the distances between y1 and z, there must be an edge between y1 and another point y2 of Y (for example, take y2 to be
the closest point in Y to y1 ). This edge, together with the edge before it in the ordering around y1 , form the required triangle.

35
The idea of storing a quadtree in an ordered set by using the Q-order on the nodes (or even only on the
leaves) is due to Gargantini [Gar82], and it is referred to as linear quadtrees in the literature. The idea was
used repeatedly for getting good performance in practice from quadtrees.
It is maybe beneficial to emphasize that if one does not require the internal nodes of the compressed
quadtree for the application, then one can avoid storing them in the data-structure. In fact, if one is only
interested in the point themselves, then can even skip storing the leaves themselves, and then the compressed
quadtree just becomes a data-structure that stores the points according to their Z-order. This approach can
be used for example to construct a data-structure for approximate nearest neighbor [Cha02] (however, this
data-structure is still inferior, in practice, to the more optimized but more complicated data-structure of Arya
et al. [AMN+ 98]). The author finds that thinking about such data-structures as compressed quadtrees (with
the whole additional unnecessary information) more intuitive, but the reader might disagree .

Z-order and space filling curves. The idea of using Z-order for speeding up spatial data-structures can
be traced back to the above work of Gargantini [Gar82], and it is widely used in databases and seems to
improve performance in practice [KF93]. The Z-order can be viewed as a mapping from the unit interval
to the unit-square, by splitting the odd bits, of a real number α ∈ [0, 1), to be the x-coordinate and the even
bits of α to encode the y-coordinate of the mapped point. While this mapping is simple to define it is not
continuous. Somewhat surprisingly one can find a continuous mapping that maps the unit interval to the
unit-square, see Exercise 2.7.4. A large family of such mappings is known by now, see Sagan [Sag94] for
an accessible book on the topic.

But is it really practical? Quadtrees seems to be widely used in practice and perform quite well. Com-
pressed quadtrees seems to be less widely used, but they have the benefit of being much simpler than their
relatives which seems to be more practical but theoretically equivalent.

Good triangulations. Balanced quadtree and good triangulations are due to Bern et al. [BEG94], and our
presentation closely follows theirs. The problem of generating good triangulations had received consider-
able attention recently, as it is central to the problem of generating good meshes, which in turn are important
for efficient numerical simulations of physical processes. The main technique used in generating good trian-
gulations is the method of Delaunay refinement. Here, one computes the Delaunay triangulation of the point
set, and inserts circumscribed centers as new points, for “bad” triangles. Proving that this method converges
and generates optimal triangulations is a non-trivial undertaking, and is due to Ruppert [Rup93]. Extending
it to higher dimensions, and handling boundary conditions make it even more challenging. However, in
practice, the Delaunay refinement method outperforms the (more elegant and simpler to analyze) method of
Bern et al. [BEG94], which easily extends to higher dimensions. Namely, the Delaunay refinement method
generates good meshes with fewer triangles.
Furthermore, Delaunay refinement methods are slower in theory. Getting an algorithm to perform De-
launay refinement in the same time as the algorithm of Bern et al. is still open, although Miller [Mil04] got
an algorithm with only slightly slower running time.
Very recently, Alper Üngör came up with a “Delaunay-refinement type” algorithm, which outputs better
meshes than the classical Delaunay refinement algorithm [Üng04]. Furthermore, by merging the quadtree
approach with Üngör technique, one can get an optimal running time algorithm [HÜ05].

The author reserves the right to disagree with himself on this topic in the future if the need arise.

36
2.7 Exercises
Exercise 2.7.1 (Quadtree for fat quadtrees.) [5 Points]
A triangle 4 is called α-fat if each one of its angles is at least α, where α > 0 is a prespecified constant
(for example, α is 5 degrees). Let P be a triangular planar map of the unit square (i.e., each face is a
triangle), where all the triangles are fat, and the total number of triangles is n. Prove that the complexity of
the quadtree constructed for P is O(n).

Exercise 2.7.2 (Quadtree construction is tight.) [5 Points]

Prove that the bounds of Lemma 2.2.2 are tight. Namely, show that for any r > 2 and any positive
integer n > 2, there exists a set of n points with diameter Ω(1) and spread Φ(P) = Θ(r), and such that its
quadtree has size Ω(n log Φ(P)).

Exercise 2.7.3 (Cell queries.) [10 Points]

Let 2b be a canonical grid cell. Given a compressed quadtree T b, we would like to find the single node
b b = Pv . We will refer to such query as a cell query. Show how to support cell queries
v ∈ T , such that P ∩ 2
in compressed quadtree in logarithmic time per query.

Exercise 2.7.4 (Space filling curve.) [10 Points]

The Peano curve σ : [0, 1) → [0, 1)2 , maps a number α = 0.t1 t2 t3 . . . (the expansion is in base 3) to the
point σ(α) = (0.x1 x2 x3 . . . , 0.y1 y2 . . .), where x1 = t1 , xi = φ(t2i−1 , t2 + t4 + . . . + t2i−2 ), for i ≥ 1. Here,
φ(a, b) = a if b is even and φ(a, b) = 2 − a if b is odd. Similarly, yi = φ(t2i , t1 + t3 + · · · + t2i−1 ), for i ≥ 1.

(A) [2 Points] Prove that the mapping σ covers all the points in the open square [0, 1)2 , and it is one to one.

(B) [8 Points] Prove that σ is continuous.

Acknowledgments
The author wishes to thank John Fischer for his detailed comments on the manuscript.

37
38
Chapter 3

Well Separated Pairs Decomposition

In this chapter, we will investigate of how to represent distances between points

efficiently. Naturally, an
explicit description of the distances between n points requires listing all the n2 distances. Here we will
show that there is a considerably more compact representation which is sufficient if all we care about are
approximate distances. This representation would have many nice applications.

3.1 Well-separated pairs decomposition

Let P be a set of n points in IRd , and 1/4
> ε > 0 a parameter. One can represent the all distances between
points of P by explicitly listing the n2 pairwise distances. Of course, the listing of the coordinates of
each point gives us an alternative more compact representation (of size dn), but its not a very informative
representation. We interested in a representation that would capture the structure of the distances between
the points.

As a concrete example, consider the three points on the right. We would like q
to have a representation that captures that p has similar distance to q and r, and p r
furthermore, the q and r are close together as far as p is concerned. As such, if we
are interested in the closest pair among the three points, we will only check the Figure 3.1
distance between q and r, since they are the only pair (among the three) that might realize the closest pair.
n o
Denote by A ⊗ B = {x, y} x ∈ A, y ∈ B all the (unordered) pair of
points formed by the sets A and B.. A pair of sets of points Q and R is
(1/ε)-separated if Q R
max(diam(Q), diam(R)) ≤ ε · d(Q, R),

where d(Q, R) = minq∈Q,r∈R kq − rk. Intuitively, the pair Q ⊗ R is (1/ε)-separated if all the points of Q have
roughly the same distance to the points of R. Alternatively, imagine covering the two point sets with two
balls of minimum size, and now we require that the distance between the two balls is at least 2/ε the radius
of the larger of the two.
Thus, for the three points of Figure 3.1, the pairs {p} ⊗ {q, r} and {q} ⊗ {r} are (say) 2-separated and
described all the distances among these three points. (The gain here is quite marginal, as we replaced the
distance description, made out of three pairs of points, by distance between two pairs of sets. But stay tuned
– exciting things are about to unfold.)
Motivated by the above example, a well-separated pair decomposition is a way to describe a metric by
such “well separated” pairs of sets.

39
 
b A1 = {d}, B1 = {e} 

 {A1 , B1 } , 


a f 

 


c A2 = {a, b, c}, B2 = {e} 
 {A2 , B2 } , 



 


A3 = {a, b, c}, B3 = {d} 

 {A3 , B3 } , 




 


A4 = {a}, B4 = {b, c} 
 {A4 , B4 } , 



 


A5 = {b}, B5 = {c}  {A5 , B5 } , 
W=
 

A6 = {a}, B6 = { f } 

 {A6 , B6 } , 




 


A7 = {b}, B7 = { f } 
 {A7 , B7 } , 

d 

 


e A8 = {c}, B8 = { f } 

 {A8 , B8 } , 




 


A9 = {d}, B9 = { f } 
 {A9 , B9 } , 


 

A10 = {e}, B10 = { f } {A10 , B10 }

(i) (ii) (iii)

{A2 , B2 } ≡ {a, b, c} ⊗ {e}

f f
d e d e

a a

b c b c
(iv) (v)

Figure 3.2: (i) A point set P = {a, b, c, d, e}, (i) its decomposition into pairs, and (iii) its respective (1/2)-
WSPD. For example, the pair of points b and e (and their distance) is represented by {A2 , B2 } as b ∈ A2 and
e ∈ B2 . (iv) The quadtree T representing the point set P. (v) The WSPD as defined by pairs of vertices of T.

Definition 3.1.1 (WSPD) Awell-separated pair decomposition (WSPD ) with parameter 1/ε of P is a set
of pairs n o
W = {A1 , B1 } , . . . , {A s , Bs } ,
such that (A) Ai , Bi ⊂ P for every i.

(B) Ai ∩ Bi = ∅ for every i.

s A ⊗ B = P ⊗ P.
(C) ∪i=1 i i

(D) The sets Ai and Bi are ε−1 -separated.

Translation: For any pair of points p, q ∈ P, there is exactly one pair {Ai , Bi } ∈ W such that p ∈ Ai and
q ∈ Bi .
For a concrete example of a WSPD, see Figure 3.2.
Instead of maintaining such a decomposition explicitly, it is convenient to construct a tree T having the
points of P as leaves, and every pair, (Ai , Bi ) is just a pair of nodes (vi , ui ) of T, such that Ai = Pvi and
Bi = Pui , where Pv denote the points of P stored in the subtree of v, where v is a node of T. Naturally, in
our case, the tree we would use is a compressed quadtree of P, but any tree that decomposes the points such
that the diameter of a point set stored in a node drops quickly as we go down the tree might work.

40
This WSPD representation using a tree gives us a compact representation of the distances of the point
set.

Corollary 3.1.2 For a ε−1 -WSPD W, it holds, for any pair {u, v} ∈ W, that

∀q ∈ Pu , r ∈ Pv max(diam(Pu ), diam(Pv )) ≤ ε kq − rk .

It would usually be convenient to associate with each set Pu in the WSPD, an arbitrary representative
point repu ∈ P. Selecting and assigning these representative points can always be done by a simple DFS
traversal of the T used to represent the WSPD.

3.1.1 The construction algorithm

The algorithm works by being greedy. It tries to put into the WSPD pairs of nodes in the tree that are as high
as possible. In particular, if a pair {u, v} would be generated than the pair formed by the parents of this pair
of nodes will not be well separated. As such, the algorithm starts from the root, and try to separate it from
itself. If the current pair is not well separated, then we replace the bigger node of the pair by its children
(i.e., thus replacing a single pair by several pairs). Clearly, sooner or later this refinement process would
reach well-separated pairs, which it would output. Since it considers all possible distances up front (i.e.,
trying to separate the root from itself), it would generate a WSPD covering all pairs of points.
Let ∆(v) denote the diameter of a cell associated with a AlgWSPD(u, v, T )
node v of the quadtree T. Formally, ∆(v) = 0 if Pv is either if ∆(u) < ∆(v) then
empty or a single point. Otherwise, it is the diameter of the Exchange u and v
region associated with v ; that is ∆(v) = diam(2v ), where 2v If ∆(u) ≤ ε · d(u, v) then
(we remind the reader) is the quadtree cube associated with return { {u, v} }
the node v. Note, that since T is a compressed quadtree, we
can always decide if |Pv | > 1 by just checking if the subtree // u1 , . . . , ur - the children of u
rooted at v has more than one node (since then this subtree S
return ri=1 AlgWSPD(ui , v, T ).
must store more than one point).
We define the geometric distance between two nodes u Figure 3.3: The algorithm AlgWSPD for
and v of T to be computing well-separated pairs decomposi-
tion. The nodes u and v belong to a com-
d(u, v) = d(2u , 2v ) = min kp − qk .
p∈2u ,q∈2v pressed quadtree T of P.

We compute the compressed quadtree T of P in O(n log n) time. Next, we compute the WSPD by calling
AlgWSPD(u0 , u0 ), where u0 is the root of T and AlgWSPD is depicted in Figure 3.3.
The following lemma is implied by an easy packing argument.

d
Lemma 3.1.3 Let 2 be a cell of a grid G of IR with cell diameter x. For y ≥ x, the number of cells in G at
distance at most y from 2 is O (y/x)d .®

Lemma 3.1.4 The WSPD generated by AlgWSPD is valid. Namely, for any pair {u, v} in the WSPD, we
have
max(diam(Pu ), diam(Pv )) ≤ ε · d(u, v) and d(u, v) ≤ kq − rk ,
for any q ∈ Pu and r ∈ Pv .
®
The O(·) notation here (and the rest of the chapter) hides a constant that depends on d.

41
Proof: For every output pair {u, v}, we have
ε
max{diam(Pu ), diam(Pv )} ≤ max{∆(u) , ∆(v)} ≤ d(u, v) ≤ ε · d(u, v).
8
Also, for any q ∈ Pu and r ∈ Pv , we have

d(u, v) = d(2u , 2v ) ≤ d(Pu , Pv ) ≤ kq − rk ,

since Pu ⊆ 2u and Pv ⊆ 2v .
Finally, by induction, it follows that every pair of points of P is covered by a pair of subsets {Pu , Pv }
output by the AlgWSPD algorithm. Note, that AlgWSPD always stops if both u and v are leafs, which
implies that AlgWSPD always terminates.

Lemma 3.1.5 For a pair {u, v} ∈ W computed by AlgWSPD, we have that

max ∆(u) , ∆(v) ≤ min ∆ p(u) , ∆ p(v) .

Proof: We trivially have that ∆(u) < ∆ p(u) and ∆(v) < ∆ p(v) .
The pair {u, v} was generated because of a sequence of recursive calls AlgWSPD(u0 , u0 ), AlgWSPD(u1 , v1 ),
. . ., AlgWSPD(u s , v s ), where u s = u, v s = v, and u0 is the root of T. Assume that u s−1 = u and

v s−1 = p(v). Then ∆(u) ≤ ∆ p(v) , since the algorithm always refine the larger cell (i.e., v s−1 = p(v) in
the pair {u s−1 , v s−1 }).
Similarly, let t be the last index such that ut−1 = p(u) (namely, ut = u and vt−1 = vt ). Then, since v is an
descendant of vt−1 , it holds that

∆(v) ≤ ∆(vt ) = ∆(vt−1 ) ≤ ∆(ut−1 ) = ∆ p(u) ,

since (again) the algorithm always refines the larger cell.

Lemma 3.1.6 The number of pairs in the computed WSPD is O(n/εd ).

Proof: Let {u, v} be an output pair. Consider the sequence (i.e., stack) of recursive calls that led to this
output. In particular, assume that the last recursive call to AlgWSPD(u, v) was issued by AlgWSPD(u, v0 ),
where v0 = p(v) is the parent of v in T. Then

∆ p(u) ≥ ∆ v0 ≥ ∆(u) ,

by Lemma 3.1.5.
We charge the pair {u, v} to the node v0 , and claim that each node of T is charged at most O(ε−d )
times. To this end, fix a node v0 ∈ V(T ), where V(T ) is the set of vertices of T. Since the pair {u, v0 }
was not output by AlgWSPD (despite being considered) we conclude that 8∆(v0 ) > ε · d(u, v0 ) and as such
d(u, v0 ) < r = 8∆(v0 ) /ε. Now, there are several possibilities:

(i) ∆(v0 ) = ∆(u). But there are at most O (r/∆(v0 ))d = O(1/εd ) nodes that have the same level (i.e.,
diameter) as v0 and their cells are in distance at most r from it, by Lemma 3.1.3. Thus, this type of
charge can happened at most O(2d · (1/εd )) times, since v0 has at most 2d children.

(ii) ∆ p(u) = ∆(v0 ). By the same argumentation as above d p(u), v0 ≤ d(u, v0 ) < r. There are at most
O(1/εd ) such nodes p(u). Since the node p(u) has at most 2d children, it follows that the number of
such charges is at most O 2d · 2d · (1/εd ) .

42

(iii) ∆ p(u) > ∆(v0 ) > ∆(u). Consider the canonical grid G having 2v0 as one of its cells (see Defi-
nition 2.2.4). Let 2b be the cell in G containing 2u . Observe that 2u ( 2 b ( 2p(u) . In addition,
0 d
b, 2v0 ) ≤ d(2u , 2v0 ) = d(u, v ) < r. It follows that are at most O(1/ε ) cells like 2
d(2 b that might
participate in charging v0 , and as such, the total number of charges is O(2d /εd ), as claimed.

As such, v0 can be charged at most O 22d /εd = O 1/εd times. This implies that the total number of
pairs generated by the algorithm is O(nε−d ), since the number of nodes in T is O(n).

Since the running time of AlgWSPD is clearly linear in the output size, we have the following result.

Theorem 3.1.7 For −1 −d

1 ≥ ε > 0, one can construct a ε -WSPD of size nε , and the construction time is
O n log n + nε−d . Furthermore, for any pair {u, v} in the WSPD, we have

max(diam(Pu ), diam(Pv )) ≤ ε · d(u, v).

3.2 Applications of WSPD

3.2.1 Spanners
A t-spanner of a set of points P ⊂ IRd is a weighted graph G whose vertices are the points of P, and for any
q, r ∈ P, we have
kq − rk ≤ dG (q, r) ≤ t kq − rk ,
where dG (q, r) is the length of the shortest path in G between q and r (naturally, dG is a metric). The ratio
dG (q, r)/ kq − rk is the stretch of q and r in G. The stretch of G is the maximum stretch of any pair of points
of P.

d
Theorem 3.2.1 Given a n-point set P ⊆ IR ,and parameter 1 ≥ ε > 0, one can compute a (1 + ε)-spanner
of P with O(nε−d ) edges, in O n log n + nε−d time.

Proof: Let c ≥ 16 be an arbitrary constant, and set δ = ε/c. Compute a δ−1 -WSPD decomposition
using the algorithm of Theorem 3.1.7. For any vertex u in the quadtree T (used in computing the WSPD),

let repu be an arbitrary
point of Pu . For every pair {u, v} ∈ W, add an edge between repu , repv with
weight repu − repv , and let G be the resulting graph. Observe, that by the triangle inequality, we have that
dG (q, r) ≥ kq − rk, for any q, r ∈ P.
The upper bound on the stretch is proved by induction on the length of pairs in the WSPD. So, fix a pair
x, y ∈ P, and assume that by the induction hypothesis, that for any pair z, w ∈ P such that kz − wk < kx − yk,
it holds dG (z, w) ≤ (1 + ε) kz − wk.
The pair x, y must appear in some pair{u, v} ∈ W, where x ∈ Pu , and y ∈ Pv . Thus

repu − repv ≤ d(u, v) + ∆(u) + ∆(v) ≤ (1 + 2δ) kx − yk

and

max repu − x , repv − y ≤ max(∆(u) , ∆(v)) ≤ δ · d(u, v) ≤ δ repu − repv
1
≤ δ(1 + 2δ) kx − yk < kx − yk ,
4

43
by Theorem 3.1.7 and since δ ≤ 1/16. As such, we can apply the induction hypothesis to repu x and repv y,
implying that

dG (x, repu ) ≤ (1 + ε) repu − x and dG (repv , y) ≤ (1 + ε) y − repv .

Now, since repu repv is an edge of G, it holds dG (repu , repv ) ≤ repu − repv . Thus, by the inductive hypoth-
esis and the triangle inequality, we have that

kx − yk ≤ dG (x, y) ≤ dG (x, repu ) + dG (repu , repv ) + dG (repv , y)

≤ (1 + ε) repu − x + repu − repv + (1 + ε) repv − y

≤ 2(1 + ε) · δ · repu − repv + repu − repv

≤ (1 + 2δ + 2εδ) repu − repv
≤ (1 + 2δ + 2εδ)(1 + δ) kx − yk
≤ (1 + ε) kx − yk .

The last step follows by an easy calculation. Indeed, since cδ = ε ≤ 1 and 16δ ≤ 1 and c ≥ 11, we have that

(1 + 2δ + 2εδ) (1 + δ) ≤ (1 + 4δ)(1 + δ) = 1 + 5δ + 4δ2 ≤ 1 + 9δ ≤ 1 + ε,

as required.

3.2.2 Approximating the Minimum Spanning Tree

For a graph G, let G≤r denote the subgraph of G resulting from removing all the edges of weight (strictly)
larger than r from G.

Lemma 3.2.2 Given a set P of n points in IRd , one can compute a spanning tree T of P, such that w(T) ≤
(1 + ε)w(M), where M is the
minimum spanning tree of P, and w(T) is the total weight of the edges of T.
This takes O n log n + nε−d time.
In fact, for any r ≥ 0 and a connected component C of M≤r , the set C is contained in a connected
component of T≤(1+ε)r .

Proof: Compute a (1 + ε)-spanner G of P. Let T be the minimum spanning tree of G. Clearly, T is the
required (1 + ε)-approximate MST. Indeed, for any q, r ∈ P, let πqr denote the shortest
path between q and r
in G. Since G is a (1 + ε)-spanner, we have that w πqr ≤ (1 + ε) kq − rk, where w πqr denote the weight of
πqr in G.
We have that G0 = (P, E) is a connected subgraph of G, where
[
E= πuv
(q,r)∈M

and M = M(P) is the minimum spanning tree of P. Furthermore,

X X
w(G0 ) = w πqr ≤ (1 + ε) kq − rk = (1 + ε)w(M),
(q,r)∈M (q,r)∈M

since G is a (1 + ε)-spanner. It thus follows that w(M(G)) ≤ w(G0 ) ≤ (1 + ε)w(M(P)), where M(G) is the
minimum spanning tree of G.
The second claim follows by similar argumentation.

44
3.2.3 Approximating the Diameter

Lemma 3.2.3 Given a set P of n points in IRd , one can compute, in O n log n + nε−d time, a pair u, v ∈ P,
such that ku − vk ≥ (1 − ε) diam(P).

Proof: Compute a (4/ε)-WSPD of P. As before, we assign for each node u of T an arbitrary representa-
tive point that belongs to Pu . Then, for every pair in the WSPD, compute the distance of the representative
point of every pair. Return the pair of representatives in the WSPD farthest away from each other.
To see why it works, consider the pair q, r ∈ P realizing the diameter of P, and let {u, v} ∈ W be the pair
in the WSPD that contain the two points, respectively (i.e., q ∈ Pu and r ∈ Pv ). We have that

repu − repv ≥ d(u, v) ≥ kq − rk − diam(Pu ) − diam(Pv )
≥ (1 − 2(ε/2)) kq − rk = (1 − ε) diam(P),

since, by Corollary 3.1.2, max(diam(Pu ) , diam(Pv )) ≤ 2(ε/4) kq − rk. Namely, the distance of the two points
output by the algorithm is at least (1 − ε) diam(P).

3.2.4 Closest Pair

Let P be a set of points in IRd . We would like to compute the closest pair; namely, the two points closest to
each other in P.
To this end, compute a ε−1 -WSPD W of P, for ε = 1/2. Next, scan all the pairs of W, and check
for all the pairs {u, v} which connect singletons (i.e., |Pu | = |Pv | = 1), what is the distance between their
representatives repu and repv . The algorithm returns the closest pair of points encountered.

Analysis. Consider the pair of closest points p and q in P, and consider the pair r p Pv
{u, v} ∈ W, such that p ∈ Pu and q ∈ Pv . If Pu contains an additional point r ∈ Pu , Pu q
then we have that
k p − r k ≤ diam(Pu ) ≤ ε · d(u, v) ≤ ε kp − qk < kp − qk ,

by Theorem 3.1.7 and since ε = 1/2. Thus, kp − rk < kp − qk, a contradiction to the choice of p and q as the
closest pair. Thus, |Pu | = |Pv | = 1 and repu = q and repv = r. This implies that the algorithm indeed returns
the closest pair.

Theorem 3.2.4 Given a set P of n points in IRd , one can compute the closest pair of points of P in O(n log n)
time.

We remind the reader that we already saw a linear (expected) time algorithm for this problem in Sec-
tion 1.2. However, this is a deterministic algorithm, and it can be applied in more abstract settings where a
small WSPD still exists, while the previous algorithm would not work.

3.2.5 All Nearest Neighbors

Given a set P of n points in IRd , we would like to compute for each point q ∈ P, its p r
nearest neighbor in P (formally, this is the closet point in P \ {q} to q). This is harder
than it might seem at first, since this is not a symmetrical relationship. Indeed, q might q
be the nearest neighbor to p, but r might be the nearest neighbor to q.

45
3.2.5.1 The bounded spread case
Assume P is contained in the unit square, and diam(P) ≥ 1/4. Furthermore, let Φ = Φ(P) denote the spread
of P. Compute a ε−1 -WSPD W of P, for ε = 1/4. Arguing as in the closest pair case, we have that if the
nearest neighbor to p is q, then there exists a pair {u, v} ∈ W, such that Pu = {p} and q ∈ Pv . Thus, scan
all the pairs {u, v} with a singleton as one of their sides (i.e., |Pu | = 1), and for each such singleton Pu = {r},
record for r the closest point to it in Pv . Maintain for each point the closet point to it that was encountered.

Analysis. The analysis of this algorithm is slightly tedious, but it reveals some additional interesting prop-
erties of WSPD.
A pair of nodes {x, y} of T is a generator of a pair {u, v} if {u, v} was computed inside a recursive call
AlgWSPD(x, y).

Lemma 3.2.5 Let W be a ε−1 -WSPD of a point set P generated by AlgWSPD. Consider a pair {u, v} ∈ W,

then ∆ p(v) ≥ (ε/2)d(u, v) and ∆ p(u) ≥ (ε/2)d(u, v).
Proof: Assume, for the sake of contradiction, that ∆(v0 ) < (ε/2)`, where ` = d(u, v) and v0 = p(v). By
Lemma 3.1.5, we have that
`
∆(u) ≤ ∆ v0 < ε .
2
But then
` `
d(u, v0 ) ≥ ` − ∆ v0 ≥ ` − ε ≥ .
2 2
Thus,
`
max ∆(u) , ∆ v0 < ε ≤ ε d(u, v0 ).
2
Namely, u and v0 are well-separated, and as such {u, v0 } can not be a generator of {u, v}. Indeed, if {u, v0 } was
considered by the algorithm than it would have added it to the WSPD, and never created {u, v}.
So, the other possibility is that {u0 , v} is the generator of {u, v}, where u0 = p(u). But then ∆(u0 ) ≤ ∆(v0 ) <
ε`/2, by Lemma 3.1.5. Using the same argumentation as above, we have that {u0 , v} is a well-separated pair
and as such it can not be a generator of {u, v}.
But this implies that {u, v} can not be generated by AlgWSPD, since either {u, v0 } or {u0 , v} must be a
generator of {u, v}. A contradiction.

Claim 3.2.6 For two pairs {u, v} , {u0 , v0 } ∈ W such that 2u ⊆ 2u0 , it holds that the interiors of 2v and 2v0
are disjoint.
Proof: If u0 is ancestor of u, and v0 is an ancestor of v then AlgWSPD returned the pair {u0 , v0 } and it would
have never generated the pair {u, v}.
If u0 is ancestor of u, and v is an ancestor of v0 , then

∆(u) < ∆ u0 ≤ ∆ p(v0 ) ≤ ∆(v) ≤ ∆ p(u) ≤ ∆ u0

by Lemma 3.1.5 applied to {u0 , v0 } and {u, v}. Namely, ∆(v) = ∆(u0 ). But then, the pair {u0 , v} is a generator
of both {u, v} and {u0 , v0 }. But it is impossible that AlgWSPDgenerated both pairs when processing {u0 , v} as
can be easily verified.
Similar analysis applies for the case that u = u0 .

Lemma 3.2.7 Let P be a set n points in IRd , W a ε−1 -WSPD of P, ` > 0 be a distance, and W be the set
of pairs {u, v} ∈ W such that ` ≤ d(u, v) ≤ 2`. Then, point any point p ∈ P, the number of pairs in W
containing p is O(1/εd ).

46
Proof: Let u be the leaf of the quadtree T (that is used in computing W) storing the point p, and let π
be the path between u and the root of T. We claim that W contains at most O(1/εd ) pairs with nodes that
appears along π. Let

T = v u ∈ π, {u, v} ∈ W .

The cells of T are interior disjoint by Claim 3.2.6, and they contain all the pairs
√ in W that covers p.
So, let r be the largest power of two which is smaller than (say) ε`/(4 d). Clearly, there are O(1/εd )
cells of Gr in distance at most 2` from 2u . We account for the nodes v ∈ T , as follows:
√
(i) If ∆(v) ≥ r d then 2v contains a cell of Gr , and there are at most O(1/εd ) such cells.
√ √
(ii) If ∆(v) < r d and ∆ p(v) ≥ r d, then:

(a) If p(v) is a compressed node, then p(v) contains a cell of Gr and it has only v as a single child.
As such, there are most O(1/εd ) such charges.

(b) Otherwise, p(v) is not compressed, but then diam(2v ) = diam 2p(v) /2. As such 2v contains a
cell of Gr/2 in distance at most 2` from 2u , and there are O(1/εd ) such cells.
√ √
(iii) The case ∆ p(v) < r d is impossible. Indeed, by Lemma 3.1.5, we have ∆ p(v) < r d ≤ ε`/4 =
ε
4 d(u, v), a contradiction to Lemma 3.2.5.

We conclude that there are at most O(1/εd ) pairs that include p in W.

Lemma 3.2.8 Let P be a set n points in the plane, then one can solve the all nearest neighbor problem, in
time O(n(log n + log Φ(P))) time, where Φ is the spread of P.

Proof: The algorithm is described above. We only remain with the task of analyzing the running time.

For a number i ∈ 0, −1, . . . , − lg Φ − 4 , consider the set of pairs Wi , such that {u, v} ∈ Wi , if and only if
{u, v} ∈ W, and 2i−1 ≤ d(u, v) ≤ 2i . A point p ∈ P can be scanned at most O(1/εd ) = O(1) times because of
pairs in Wi by Lemma 3.2.7. As such, a point get scanned at most O(log Φ) times overall, which implies the
running time bound.

3.2.5.2 All nearest neighbor - the unbounded spread case

To handle the unbounded case, we need to use some additional geometric properties.

Lemma 3.2.9 Let u be a node in the compressed QT of P, and partition the space around repu into cones of
angle ≤ π/12. Let ψ be such a cone, and let Q be the set of all points in P which are in distance ≥ 4 diam(Pu )
from repu , and they all lie inside ψ. Let q be the closest point in Q to repu . Then, q is the only point in Q that
its nearest neighbor might be in Pu .

Proof: Let p = repu and consider any point r ∈ Q.

Since kr − pk ≥ kq − pk, it follows that α = ∠rqp ≥ ∠qrp = γ. Now, ψ q
α + γ = π − β, where β = ∠rpq. But β ≤ π/3, and as such α ≥ (π − ∠rpq)/2 ≥ p β
α Q
π/3 ≥ β. Namely, α is the largest angle in the triangle 4pqr, which implies γ
r
kr − pk ≥ kr − qk. Namely, q is closer to r than p, and as such p can not serve Pu
as the nearest neighbor to r in P.

47
It is now straightforward (but tedious) to show that, in fact, for any p ∈ Pu , we have kr − pk ≥ kr − qk,
which implies the claim.

Lemma 3.2.9 implies that we can do a top-down traversal of QT (P), after computing a ε−1 -WSPD W
of P, for ε = 1/16. For every node u, we maintain a (constant size) set Ru of candidate points that Pu might
contain their nearest neighbor.
So, assume we had computed Rp(u) , and consider the set
[
X(u) = Rp(u) ∪ Pv .
{u,v}∈W,|Pv |=1

(Note, that we do not have to consider pairs with |Pv | > 1, since no point in Pv can have its nearest neighbor
in Pu in such a scenario.) Clearly, we can compute X in linear time in the number of pairs in W involved
with u. Now, we build a “grid” of cones around repu , and throw the points of X(u) into this grid. For each
such cone, we keep only the closest point to repu . Let Ru be the set of these closest points. Since the number
of cones is O(1), it follows that |Ru | = O(1).
Now, if Pu contains only a single point p, then we compute for any point q ∈ Ru its distance to p, and if
p is a better candidate to be a nearest neighbor, then we set p as the (current) nearest neighbor to q.
Clearly, the resulting running time (ignoring the computation of the WSPD) is linear in the number of
pairs of the WSPD and the size of the compressed quadtree. The correctness follows since p is the nearest
neighbor to q, then there must be a WSPD pair {u, v} such that Pv = {q} and p ∈ Pu . But then, the algorithm
would add q to the set Ru , and it would be in Rz , for all descendants z of u in the quadtree, such that p ∈ Pz .
In particular, if y is the leaf of the quadtree storing p, then q ∈ Ry , which implies that the algorithm computes
correctly the nearest neighbor to q.

Theorem 3.2.10 Given a set P of n points in IRd , one can solve the all nearest neighbor problem in
O(n log n) time.

3.3 Bibliographical Notes

Well separated pairs decomposition was defined by Callahan and Kosaraju [CK95]. They defined a different
space decomposition tree, known as the fair split tree. Here, one compute the axis parallel bounding box
of the point-set, and always split along the longest edge by a perpendicular plane in the middle (or near
the middle). This splits the point set into two sets, which we construct fail split tree for them recursively.
Implementing this in O(n log n) time requires some cleverness. See [CK95] for details.
Our presentation of WSPD (very roughly) follows [HM06]. The (easy) observation that WSPD can be
generated directly from a compressed quadtree (thus avoiding the fair split tree mess) is from there.
Callahan and Kosaraju [CK95] were inspired by the work of Vaidya [Vai86] on all nearest neighbor
problem (i.e., compute for each points in P, their nearest neighbor in P). He defined the fair split tree, and
show how to compute the all nearest neighbors in O(n log n) time. However, the first to give an O(n log n)
time algorithm for the all nearest neighbor algorithm was Clarkson [Cla83] (this was part of his PhD thesis).
kp−rk kq−rk sin β

Here are the details for readers of little fate. By the law of sines, we have sin α
= sin β
. As such, kq − rk = kp − rk sin α
. Now, if
sin β sin β
α ≤ π − 3β then kq − rk = kp − rk sin α
≤ kp − rk sin(3β) ≤ kp2−rk < kp − rk − ∆(u), since kp − rk ≥ 4∆(u). This implies that no point of
Pu can be the nearest neighbor of r.
sin β
If α ≥ π − 3β then the maximum length of qr is achieved when γ = 2β. The sines law then implies that kq − rk = kp − qk sin(2β) ≤
3 3
4
kp − qk ≤ 4 kp − rk < kp − rk − ∆(u), which again implies the claim.

48
Diameter. The algorithm for computing the diameter of Section 3.2.3 can be improved by not constructing
pairs that can not improve the (current) diameter, and constructing the underlying tree on the fly together
with the diameter. This yields a simple algorithm that works quite will in practice, see [Har01a].

All nearest neighbors. Section 3.2.5 is a simplification of the solution for the all k-nearest neighbor
problem. Here, one can compute for every point its k-nearest neighbors in O(n log n + nk) time. See [CK95]
for details.
The all nearest neighbor algorithm for bounded spread (Section 3.2.5.1) is from [HM06]. Note, that
unlike the unbounded case, this algorithm only use packing arguments for its correctness. Surprisingly,
the usage of the Euclidean nature of the underlying space (as done in Section 3.2.5.2) seems to be crucial
in getting a faster algorithm for this problem. In particular, for the case of metric spaces of low doubling
dimension (that do have a small WSPD), solving this problem requires Ω(n2 ) time in the worst case.

Dynamic maintenance. WSPD can be maintained in polylogarithmic time under insertions and deletions.
This is quite surprising when one considers that in the worst case, a point might participate in linear number
of pairs, and in fact, a node in the quadtree might participate in linear number of pairs. This is described in
detail in Callahan thesis [Cal95]. Interestingly, using randomization maintaining the WSPD can be consid-
erably simplified, see the work by Fischer and Har-Peled [FH05].

High dimension. In high dimensions, as the uniform metric demonstrates (i.e., n points all of them in
distance 1 from each other) the WSPD can have quadratic complexity. This metric is easily realizable as
the vertices of a simplex in IRn−1 . On the other hand, doubling metrics have near linear size WSPD. Since
WSPDs by themselves are so powerful, it kind of tempting to try and define dimension of a point set by the
size of the WSPD it posses. This seems like an interesting direction for future research, as currently little is
known about it (to the best of my knowledge).

3.4 Exercises
Exercise 3.4.1 (WSPD Structure.) [5 Points]

(A) Let ε > 0 be sufficiently small constant. For any n sufficiently large, show an example of a point set P
of n points, such that its (1/ε)-WSPD (as computed by AlgWSPD) has the property that a single set
participates in Ω(n) sets.®

(B) Show, that if we list explicitly the sets forming the WSPD (even if we show each set exactly once) then
the total size of such a description is quadratic. (Namely, the implicit representation we use is crucial
to achieve efficient representation.)

Exercise 3.4.2 (Number of resolutions that matter.) [4 Points]

Let P be a n-point set in IRd , and consider the set U = i 2i ≤ kp − qk ≤ 2i+1 , for p, q ∈ P . Prove that
|U| = O(n) (the constant depends on d). Namely, there are only n different resolutions that “matter”.
(The claim is a special case of a more general claim, see Exercise 24.7.2.)

Exercise 3.4.3 (WSPD and sum of distances.) [5 Points]

®
Note, that there is always a WSPD construction such that each node participates in a “small” number of pairs.

49
P
Let P be a set of n points in IRd . The sponginess¯ of P is the quantity X = {p,q}⊆P kp − qk. Provide an
efficient algorithm for approximating X. Namely, given P and a parameter ε > 0 it outputs a number Y such
that X ≤ Y ≤ (1 + ε)X.
(The interested reader can also verify that computing (exactly) the sum of all squared distances (i.e.,
P 2
{p,q}⊆P kp − qk ) is considerably easier.)

¯
Also known as the sum of pairwise distances in the literature, for reasons that I can not fathom.

50
Chapter 4

Clustering - Definitions and Basic

Algorithms

Do not read this story; turn the page quickly. The story may upset you. Anyhow, you probably know it already. It
is a very disturbing story. Everyone knows it. The glory and the crime of Commander Suzdal have been told in a
thousand different ways. Don’t let yourself realize that the story is the truth.
It isn’t. not at all. There’s not a bit of truth to it. There is no such planet as Arachosia, no such people as klopts, no
such world as Catland. These are all just imaginary, they didn’t happen, forget about it, go away and read something
else.
– The Crime and Glory of Commander Suzdal, Cordwainer Smith

In this chapter, we will initiate our discussion of clustering. Clustering is one of the most fundamental
computational tasks, but frustratingly, one of the fuzziest. It can be stated informally as: “Given data, find
interesting structure in the data. Go!”
The fuzziness arise naturally from the requirement that it would be “interesting”, as this is not well
defined and depends on human perception which is sometime impossible to quantify clearly. Similarly,
what is “structure” is also open to debate. Nevertheless, clustering is inherent to many computational tasks
like learning, searching and data-mining.
Empirical study of clustering concentrates on trying various measures for the clustering, and trying out
various algorithms and heuristics to compute these clusterings. See bibliographical notes for some relevant
references.
Here, we will concentrate on some well defined clustering tasks, including k-center clustering, k-median
clustering, and k-means clustering, and some basic algorithms for these problems.

4.1 Preliminaries
A clustering problem is usually defined by a set of items, and a distance function defined between these
items. While these items might be points in IRd and the distance function is just the regular Euclidean
distance, it is sometime beneficial to consider the more abstract setting of a general metric space.

Definition 4.1.1 A metric space is a pair (X, d) where X is a set and d : X × X → [0, ∞) is a metric,
satisfying the following axioms: (i) d(x, y) = 0 if and only if x = y, (ii) d(x, y) = d(y, x), and (iii) d(x, y) +
d(y, z) ≥ d(x, z) (triangle inequality).
For example, IR2 with the regular Euclidean distance is a metric space. In the following, we assume that
we are given a black-box access to d. Namely, given two points p, q ∈ X, we assume that d(p, q) can be
computed in constant time.

51
The input is a set of n points P ⊆ X. Given a set of centers C, every point of P is assigned to its nearest
neighbor in C. All the points of P that are assigned to a center c form the cluster of c, denoted by

Π C, c = p ∈ P d(p, c) ≤ d(p, C) .

Namely, the center set C partition P into clusters. This specific scheme of partitioning points by assigning
them to their closest center (in a given set of centers) is known as a Voronoi partition.

4.2 On k-Center Clustering

In the k-center clustering problem, a set P ⊆ X of n points is provided together with a parameter k. We
would like to find k points X ⊆ P, such that the maximum distance of a point in P to the closest point in X
is minimized.
As a concrete example, consider the set of points to be a set of cities. Distances between points represent
the time it takes to travel between the corresponding points. We would like to build k hospitals and minimize
the maximum time it takes a patient to arrive to its closest hospital.
Formally, given a set of centers C, the k-center clustering price of P by C is denoted by
C
ν∞ (P) = max d(p, C) ,
p ∈P

where d(p, C) = minc∈C d(p, c) denotes the distance of p to the set C. Note, that every point in a cluster is
C (P) from its respective center.
in distance at most ν∞
C (P) is minimized; namely,
Formally, the k-center problem is to find a set C of k points, such that ν∞
opt C
ν∞ (P, k) = min ν∞ (P) .
C,|C|=k

We will denote the set of centers realizing the optimal clustering by Copt . A more explicit definition
(and somewhat more confusing) of the k-center clustering is to compute the set C of size k realizing
minC maxp minc∈C d(p, c).
It is known that the k-center clustering is NP-H, and it is in fact hard to approximate within a factor
of 1, 86 even in two dimensions. Surprisingly, there is a simple and elegant algorithm that achieves 2-
approximation.

4.2.1 The Greedy Clustering Algorithm

The greedy algorithm GreedyKCenter starts by picking an arbitrary point c1 into C1 . Next, we compute
for every point p ∈ P its distance d1 [p] from c1 . Next, consider the point worst served by c1 ; this is the point
realizing r1 = maxp∈P d1 [p]. Let c2 denote this point, and add it to the set of centers C2 .
Specifically, in the ith iteration, we compute for each point p ∈ P the quantity di−1 [p] = minc∈Ci−1 d(p, c).
We also compute the radius of the clustering

ri−1 = max di−1 [p] = max d(p, Ci−1 ) , (4.1)

p ∈P p ∈P

and the bottleneck point ci that realizes it. Next, we add ci to Ci−1 to form the new set Ci . We repeat this
process k times.
To make this algorithm slightly faster, observe that

di [p] = d(p, Ci ) = min d(p, Ci−1 ) , d(p, ci ) = min di−1 [p], d(p, ci ) .

52
In particular, if we maintain for p a single variable d[p] with its current distance to the closest center in the
current center set, then the above formula boils down to

d[p] ← min d[p], d(p, ci ) .

Namely, the above algorithm can be implemented using O(n) space, where n = |P|. The ith iteration of
choosing the ith center takes O(n) time. Thus, overall this approximation algorithm takes O(nk) time.
A ballof radius
r around a point p ∈ P is the set of points in P with distance at most r from p; namely,
b(p, r) = q ∈ P d(p, q) ≤ r . Thus, the k-center problem can be interpreted as the problem of covering
the points of P using k balls of minimum (maximum) radius.

Theorem 4.2.1 Given a set of n points P ⊆ X, belonging to a metric space (X, d), the algorithm GreedyK-
Center computes a set K of k centers, such that K is a 2-approximation to the optimal k-center clustering
K (P) ≤ 2νopt , where νopt = νopt (P, k). The algorithm takes O(nk) time.
of P; namely, ν∞ ∞ ∞ ∞

Proof: The running time follows by the above description, so we concern ourselves only with the ap-
proximation quality.
K (P), and let c
By definition, we have rk = ν∞ k+1 be the point in P realizing rk = d(K, P). Let C = K ∪ {r}.
Observe that by the definition of ri , we have that r1 ≥ r2 ≥ . . . ≥ rk . Furthermore, for i < j ≤ k + 1 we have
that
d(ci , c j ) ≥ d(c j , C j−1 ) = r j ≥ rk .
Namely, the distance between the closest points in C is at least rk . Now, assume for the sake of contradiction
opt opt
that rk > 2ν∞ (P, k). Consider the optimal solution, that covers P with k balls of radius ν∞ . By the triangle
opt
inequality, any two points inside such a ball are in distance at most 2ν∞ from each other. Thus, none of
opt
these balls can cover two points of C ⊆ P, since the minimum distance in C is > 2ν∞ . A contradiction,
opt
since then we need k + 1 balls of radius ν∞ to cover P.
In the spirit of never trusting a claim that has only a single proof, we provide an alternative proof.°
Alternative Proof: If every cluster of Copt contains exactly one point of K then the claim follows. Indeed,
consider
a point
p ∈ P, and let c be the center
it belongs to in Copt . Also, let k be the center of K that
is
opt opt
in Π Copt , c . We have that d p, c = d p, Copt ≤ ν∞ = ν∞ (P, k). Similarly, observe that d k, c =
opt
opt
d k, Copt ≤ ν∞ . As such, by the triangle inequality, we have that d p, k ≤ d p, c + d c, k ≤ 2ν∞ .
By the pigeon hole principle, the only other possibility is that there are two centers k and u of K that are
both in Π Copt , c , for some c ∈ Copt . Assume, without loss of generality, that u was added later to the center
set K by the algorithm GreedyKCenter, say in the ith iteration. But then, since GreedyKCenter always
chooses the point furthest away from the current set of centers, we have that c ∈ Ci−1 and
Ci−1
opt
K
ν∞ (P) ≤ ν∞ (P) = d(u, Ci−1 ) ≤ d u, k ≤ d u, c + d c, k ≤ 2ν∞ .

4.2.2 The greedy permutation

There is an interesting phenomena associated with GreedyKCenter. We can ran it till it exhausts all the

points of P (i.e., k = n). Then, this algorithm generates a permutation of P; that is P = c1 , c2 , . . . , cn .
We will refer to P as the greedy permutation of P. There is also an associated sequence of radiuses

hr1 , r2 , . . . , rn i, where all the points of P are in distance at most ri from the points of Ci = c1 , . . . , ci .
°
Mark Twain is credited with saying that “I don’t give a damn for a man that can only spell a word one way.” However, there
seems to be a doubt if he really said that, which brings us to the conclusion of never trusting a quote if it is credited only to a single
person.

53
Definition 4.2.2 A set S ⊆ P is a r-net for P if the following two properties hold.

(i) Covering property: All the points of P are in distance at most r from the points of P.

(ii) Separation property: For any pair point of points p, q ∈ S , we have that d(p, q) ≥ r.

(One can relax the separation property by requiring that the points of S would be at distance Ω(r) apart.)
Intuitively, a r-net of a point-set P is a compact representation of P in the resolution r. Surprisingly, the
greedy permutation of P provides us with such a representation for all resolutions.

Theorem 4.2.3 Let P be a set of n points in a finite metric space, and let its greedy permutation be
hc1 , c2 , . . . , cn i with the associated sequence of radiuses hr1 , r2 , . . . , rn i. For any i, we have that Ci =

c1 , . . . , ci is a ri -net of P.
Proof: Note, that by construction rk = d(ck , Ck−1 ), for all k = 1, . . . , n. As such, for j < k < i ≤ n, we have
that d(c j , ck ) ≥ rk ≥ ri , which implies the required separation property. The covering property follows by
definition, see Eq. (4.1).

4.3 On k-median clustering

In the k-median clustering problem, a set P ⊆ X is provided together with a parameter k. We would like to
find k points C ⊆ P, such that the sum of distances of points of P to their closest point in C is minimized.
Formally, given a set of centers C, the k-center clustering price of clustering P by C is denoted by
X
ν1C (P) = d(p, C) ,
p ∈P

where d(p, C) = minc∈C d(p, c) denotes the distance of p to the set C.

Formally, the k-median problem is to find a set C of k points, such that ν1C (P) is minimized; namely,
opt
ν1 (P, k) = min ν1C (P) .
C,|C|=k

We will denote the set of centers realizing the optimal clustering by Copt .
There is a simple and elegant constant factor approximation algorithm for k-median clustering using
local search (its analysis however is painful).

A note on notations. Consider the set U of all k-tuples of points of P. Let pi denote the ith point of P, for
i = 1, . . . , n, where n = |P|. for C ∈ U, consider the n dimensional point

φ(C) = (d(p1 , C) , d(p2 , C) , . . . , d(pn , C)) .

C (P ) = opt
Clearly, we have that ν∞ kφ(C)k∞ = maxi d(pi , C). And ν∞ (P, k) = min kφ(C)k∞ .
C∈U
P opt
Similarly, we have that ν1C (P) = kφ(C)k1 = i d(pi , C). And ν1 (P, k) = min kφ(C)k1 .
C∈U
Namely, k-center clustering under this interpretation is just finding the point minimizing the `∞ norm in
a set U of points in n dimensions. Similarly, the k-median problem is to find the point minimizing the `1
norm in the set U. Since `1 and `∞ are equal up to a factor equal to the dimension, we get the following.
opt opt
Observation 4.3.1 For any point set P of n points and a parameter k, we have that ν∞ (P, k) ≤ ν1 (P, k) ≤
opt
n ν∞ (P, k).

54
4.3.1 Local Search
We are given a set P of n points and a parameter k. In the following, let

ν1 (C) = ν1C (P) ,

opt opt
Copt denote the set of centers realizing the optimal solution, and let ν1 = ν1 (P, k)

A 2n-approximation. Observation 4.3.1 implies that if we compute a set of centers C using Theorem 4.2.1
then we have that
C opt opt opt
ν1 (C) /2n ≤ ν∞ (P) /2 ≤ ν∞ (P, k) ≤ ν1 ≤ ν1 (C) ⇒ ν1 (C) ≤ 2n ν1 . (4.2)

Namely, C is a 2n-approximation to the optimal solution.

Improving it. Let 0 < τ < 1 be a parameter to be determined shortly. The local search algorithm AlgLo-
calSearchKMed initially sets the current set of centers Ccurr to be C. Next, at each iteration it checks if
the current solution Ccurr can be improved by replacing one of the centers in it by a center from the outside
(we will refer to such an operation as a swap. There are at most |P| |Ccurr | = nk choices to consider, as we
pick a center c ∈ Ccurr to throw away and a new center to replace it by o ∈ (P \ Ccurr ). We consider the

new candidate set of centers K ← Ccurr \ c ∪ o . If ν1 (K) ≤ (1 − τ)ν1 (Ccurr ) then the algorithm sets
Ccurr ← K. The algorithm continue iterating in this fashion over all possible swaps.
The algorithm AlgLocalSearchKMed stops when there is no swap that would improve the current
solution by a factor of (at least) (1 − τ). The final content of the set Ccurr is the required constant factor
approximation. Note, that the running time of the algorithm is
  ! !
 ν (C) 
 2 1  2 2 log n 2 log n
O(nk) log1/(1−τ) opt  = O (nk) log1+τ (2n) = O (nk) = O (nk) ,
ν1 ln(1 + τ) τ

by Eq. (4.2) and since 1/(1 − τ) ≥ 1 + τ. The final step follows since 1 + τ ≤ exp(τ) ≤ 1 + 2τ, for τ < 1/2
as can be easily verified. Thus, if τ is polynomially small, then the running time would be polynomial.

4.3.2 Proof of correctness - a game of dictators and drifters

The proof of correctness is somewhat involved and the reader might want to skip it on a first reading.
For the sake of simplicity of exposition, let us assume (for now) that the solution returned by the algo-
rithm can not be improved (at all) by any swap, and let C be this set of centers. For a center c ∈ C and a

point o ∈ P \ C, let C − c + o = C \ c ∪ o denotes the set of centers resulting from applying the swap
c → o to C. We are assuming that there is no beneficial swap; that is,

∀c ∈ C, o ∈ P \ C 0 ≤ ∆(c, o) = ν1 C − c + o − ν1 (C) . (4.3)

The contribution of a point p ∈ P to this quantity is

δp = d(p, C − c + o) − d(p, C).

Clearly, if p is served by the same center in C and in C − c + o then δp = 0. In particular, we have that

∀p ∈ P \ Π C, c it holds δp ≤ 0. Thus,
X X X
∀c ∈ C, o ∈ Copt 0 ≤ ∆(c, o) ≤ δp = δp + δp . (4.4)
p∈Π(C,c)∪Π(Copt ,o) p∈Π(Copt ,o) p∈Π(C,c)\Π(Copt ,o)

55
What Eq. (4.4) gives us is a large family of inequalities that all of them hold together. Each inequality is
represented by a swap c → o. We would
like to pick a set of swaps such that these inequalities, when added
together, would imply that 5ν1 Copt ≥ ν1 (C); namely, that the local search algorithm provides a constant
factor approximation to optimal clustering. This idea seems to be somewhat mysterious, but to see that there
is indeed hope to achieve that, observe that if our set of swaps T has each center of Copt appearing in it
exactly once, then when we add up the first term on the right side of Eq. (4.4), we have
   
X  X 
 X  X


 δ p
 = 
 d( p , C − c + o) − d( p , C) 
  
c→o∈T p∈Π(Copt ,o) c→o∈T p∈Π(Copt ,o)
 
X  X  X
≤  d( p , C ) − d( p, C)  = d( p, C ) − d( p, C)
 opt  opt
c→o∈T p∈Π(Copt ,o) p ∈P
opt
= ν1 − ν1 (C) , (4.5)

since o ranges over the elements of Copt exactly once, and for o ∈ Copt and p ∈ Π Copt , o we have that
d(p, C − c + o) ≤ d(p, o) = d(p, Copt ). Thus, we can bound the first term of the right side of Eq. (4.4) as
required. We thus turn our attention to bounding the second term.

Some intuition about this term: This term is the total change in contribution by points in p ∈ Π C, c \
Π Copt , o . These are points that lose their current (beloved) center c. These points might be reassigned

to the new center o and their price might not change by a lot. However, the fact that p < Π Copt , o is
(intuitively) a sign that p and o might be very far away, and p would have to look far afield for a new center
in C − c + o to serve it. Namely, this might be too expensive
overall. To minimize it, intuitively, we would

like to make the overall size of the sets Π C, c \ Π Copt , o as small as possible (over our set of swaps T ).

Intuitively, these sets are minimized when Π C, c is as similar to Π Copt , o as possible.
Thus, we will say that c ∈ C dominates o ∈ Copt if and only if

Π C, c ∩ Π Copt , o > Π Copt , o /2.

Clearly, since the clusters of C form a partition of P, we have that a center o ∈ Copt can be dominated by at
most one center of C.
In addition, there are centers in C that dominates more than one center in Copt , and we will refer to such
centers as dictators. Nobody wants to deal with dictators (including us) and as such we will not have any
dictators involved in swaps in T . The other kind are centers of C that dominates no center in Copt . We will
refer to such centers as drifters. Now, assume that there are overall D dictators in C dominating S centers
of Copt . Let F be the number of centers in Copt which are not dominated by anybody (“free” centers). Then,
we have that total number of drifters is exactly

F + S − D,

as can be easily verified. Note, that S ≥ 2D, since every dictator has at least two “slaves”. As such, the
number of drifters is at least F + S − D ≥ F + S − (S /2) = F + S /2.

The set of swaps T. The set of swaps T we would consider would be constructed as follows. If c ∈ C
dominates exactly one center o in Copt , then the swap c → o will be included in T . In addition, we will
also add swaps between drifters of C with centers of Copt that are not swapped yet by swaps of T . The end

Except for arm dealers, art dealers and superpowers, of course.

56
result is that all the centers of Copt will be covered by the set of swaps T . We will do it in such a way that
every drifter participates in at most two swaps of T . Note, that the resulting set of swaps T might swap-out
a center of C at most twice, and it swap-in each center of Copt exactly once. To see why this is possible,
observe that we have at least F + S /2 drifters that we use to cover F + S “free” and “slave” vertices of Copt .
Clearly, this can be done by using each drifter at most twice in swaps of T .®

With a little help from

my friends. We are still left with the task of assigning the “orphaned” points in

Π C, c \ Π Copt , o to a center of C \ c , for a swap c → o ∈ T . To this end, every point p of P finds a
friend π(p) ∈ P such that p and π(p) have two different centers serving them in C. In particular, if p becomes
orphaned than it can be served by going to its friend π(p) and travel from there to the center of C that serves
π(p).
Since the total distance of a point to its friend is the overhead of this reassignment policy, we would
like
to make these edges as short as possible. Thus, the permutation π would map the elements of a cluster
Π Copt , o to itself, for any o ∈ Copt . The basic idea here is that then we can charge the traveling (of a point
to its friend) to the cost of the optimal clustering. In a perfect world, π would also map every cluster of C
into a different cluster of C. This is not quite possible, but we can prove the following weaker claim.

Lemma 4.3.2 One can find a permutation π, such that for any o ∈ Copt and c ∈ C and p ∈ U = Π Copt , o ∩

Π C, c such that if |U| < Π Copt , o /2 (i.e., c does not dominates o) then π(p) < Π C, c .

Proof: Let us demystify this claim. Let A = Π Copt , o , and let Bi = A ∩ Π C, ci , for i − 1, . . . , k, where

C = c1 , . . . , ck . In addition, let n = |A| and assume that |B1 | ≥ |B2 | ≥ · · · ≥ |Bk |.
Clearly, the sets B1 , . . . , Bk form a partition of A. The claim is that we can find a permutation π of A,
such that for every element p ∈ Bi we have that π(p) < Bi , as long as |Bi | ≤ n/2.
But coming up with such a mapping is easy. Indeed, enumerate the elements of A in such a way that
the elements of B1 are first, then the elements of B2 , and so on. Let pi be the ith point in this ordering, and
consider the mapping π(pi ) = p((i+dn/2e) mod n)+1 . Since it shifts the elements of A by distance n/2, a set of
size at most bn/2c is mapped by π outside its range. Thus establishing the claim.
P opt
Lemma 4.3.3 We have that p ∈P d(p, π(p)) ≤ 2ν1 .

Proof: For any point p ∈ P, we have that π(p) and p lie in the same cluster in the optimal clustering.
As such, by the triangle inequality, we have d(p, π(p)) ≤ d(p, Copt ) + d(π(p), Copt ). As such, since π is a
permutation, we have that
X X opt
d(p, π(p)) ≤ d(p, Copt ) + d(π(p), Copt ) = 2ν1 .
p ∈P p ∈P

Lemma 4.3.4 For any c → o ∈ T and any o0 ∈ Copt such that o , o0 , we have that c does not dominates o0 .

Proof: If c is a drifter then the claim trivially holds. Otherwise, since c can not be a dictator, it must be that
it dominates only o, which implies that it does not dominates o0 .
X X
opt
Lemma 4.3.5 We have that δp ≤ 4ν1 .
c→o∈T p∈Π(C,c)\Π(Copt ,o)

®
Which comes to teach us that even drifters can be useful sometimes.

57

Proof: For a swap c → o ∈ T consider a point p ∈ Π C, c \ Π Copt , o . Let o0 be the optimal center which
its cluster contains p. By Lemma 4.3.4, we know that c does not dominates o0 , and as such, by Lemma 4.3.2

, we have π(p) < Π C, c . As such, we have that

d(p, C − c + o) ≤ d(p, π(p)) + d(π(p), C − c + o) ≤ d(p, π(p)) + d(π(p), C),

since the center of π(p) was not swapped out. As such,

δp = d(p, C − c + o) − d(p, C) ≤ d(p, π(p)) + d(π(p), C) − d(p, C) = νp .

Note, that by the triangle inequality, νp is always non-negative. As such, we have that
X X X X X
γ= δp ≤ νp ≤ 2 νp ,
c→o∈T p∈Π(C,c)\Π(Copt ,o) c→o∈T p∈Π(C,c)\Π(Copt ,o) p∈P

since each center of c get swapped out at most twice by T , and as such a point might contribute twice to the
summation. But then
X X X
opt
γ≤2 νp = 2 d(p, π(p)) + d(π(p), C) − d(p, C) = 2 d(p, π(p)) ≤ 4ν1 ,
p ∈P p∈P p ∈P

since π is a permutation, and by Lemma 4.3.3.

opt
Lemma 4.3.6 Assuming that Eq. (4.4) holds, we have that ν1 (C) ≤ 5ν1
Proof: We add Eq. (4.4) overall all the swaps in T . We have that
 
X  X X 
 X X X X
0 ≤  δ + δ p
 ≤ δ + δp
 p  p
c→o∈T p∈Π(Copt ,o) p∈Π(C,c)\Π(Copt ,o) c→o∈T p∈Π(Copt ,o) c→o∈T p∈Π(C,c)\Π(Copt ,o)
opt opt
≤ ν1 − ν1 (C) + 4ν1 .
opt
by Eq. (4.5) and Lemma 4.3.5. Implying that ν1 (C) ≤ 5ν1 .

Removing the strict improvement assumption. In the above proof, we assumed that the current local
minimum can not be improved by a swap. Of course, this might not hold for the algorithm solution, since
the algorithm allows a swap only if it makes “significant” progress. In particular, Eq. (4.3) is in fact

∀c ∈ C, o ∈ P \ C − τ ν1 (C) ≤ ν1 C − c + o − ν1 (C) . (4.6)

To adapt the proof to use this modified inequalities, observe that the proof worked by adding up k
inequalities defined by Eq. (4.3) and showing that the right side is bounded by 5ν1 Copt − ν1 (C). Repeating
the same argumentation on the modified inequalities, would yield

−τk ν1 (C) ≤ 5ν1 Copt − ν1 (C) .
opt
This implies ν1 (C) ≤ 5ν1 /(1 − τk). For arbitrary 0 < ε < 1, Setting τ = ε/2k we have that ν1 (C) ≤
opt
5(1 + εk)ν1 , since 1/(1 − τk) ≤ 1 + 2τk = 1 + ε, for τ ≤ 1/2k. We summarize:

Theorem 4.3.7 Let P be a set of n points in a metric space. For 0 < ε < 1, one can compute a (5 + ε)-
approximation to the optimal k-median clustering of P. The running time of the algorithm is O n2 k3 logε n .

58
4.4 On k-means clustering
In the k-means clustering problem, a set P ⊆ X is provided together with a parameter k. We would like to
find a set of k points C ⊆ P, such that the sum of squared distances of all the points of P to their closest
point in C is minimized.
Formally, given a set of centers C, the k-center clustering price of clustering P by C is denoted by
X
ν2C (P) = (d(p, C))2 ,
p ∈P

and the k-means problem is to find a set C of k points, such that ν2C (P) is minimized; namely,
opt
ν2 (P, k) = min ν2C (P) .
C,|C|=k

4.4.1 Local Search

Local search also works for k-means and yields a constant factor approximation.

Theorem 4.4.1 Let P be a set of n points in a metric space. For 0 < ε < 1, one can compute a (25 + ε)-
approximation to the optimal k-means clustering of P. The running time of the algorithm is O n2 k3 logε n .

4.5 Bibliographical Notes

In this chapter we introduced the problem of clustering and showed some algorithms that achieve constant
factor approximation. A lot more is known about these problems including faster and better clustering
algorithms but to discuss them we need more advanced tools than what we currently have at hand.
Clustering is widely researched. Unfortunately, a large fraction of the work on this topic rely on heuris-
tics or experimental studies. The inherent problem seems to be the lack of a universal definition of what is a
good clustering. This depends on the application at hand, as is rarely clearly defined. In particular, no clus-
tering algorithm can achieve all desired properties together, see the work by Kleinberg [Kle02] (although it
is unclear if all these desired properties are indeed natural or really desired).

k-center Clustering. The algorithm GreedyKCenter was described by Gonzalez [Gon85], but it was
probably known before, as the notion of r-net is much older. The hardness of approximating k-center
clustering was shown by Feder and Greene [FG88].

k-median/means clustering. The local search algorithm is due to Arya et al. [AGK+ 01]. The extension
to k-mans is due to Kanungo et al. [KMN+ 04]. The extension is not completely trivial since the triangle
inequality no longer holds. However, some approximate version of the triangle inequality does hold. Instead
of performing a single swap, one can decide to do p swaps simultaneously. Thus, the running time deteri-
orates since there are more possibilities to check. This improves the approximation constant for k-median
(resp., k-means) to (3 + 2/p) (resp. (3 + 2/p)2 ). Unfortunately, this is (essentially) tight in the worst case.
See [AGK+ 01, KMN+ 04] for details.
The k-median and k-mean clustering are more interesting in the Euclidean settings where there is con-
siderably more structure, and one can compute (1 + ε)-approximation in polynomial time for fixed ε. We
will return to this topic later.
Since k-median and k-means clustering can be easily be reduced to Dominating Set in a graph, this
implies that both clustering problems are NP-H to solve exactly.

59
One can compute a similar permutation to the greedy permutation (for k-center clustering) also for k-
median clustering. See the work by Mettu and Plaxton [MP03].

Handling Outliers. The problem of handling outliers is still not well understood. See the work of Charikar
et al. [CKMN01] for some relevant results. In particular, for k-center clustering they get a constant factor
approximation, and Exercise 4.6.2 is taken from there. For k-median clustering they present a constant fac-
tor approximation using linear programming relaxation, that also approximates the number of outliers. Re-
cently, Chen [Che07] provided a constant factor approximation algorithm by extending the work of Charikar
et al.. The problem of finding a simple algorithm with simple analysis for k-median clustering with outliers
is still open, as Chen work is quite involved.

Open Problem 4.5.1 Get a simple constant factor k-median clustering algorithm that runs in polynomial
time and uses exactly m outliers. Alternatively, solve this problem in the case where P is a set of n points in
the plane. (The emphasize here is that the analysis of the algorithm should be simple.)

Bi-criteria approximation. All clustering algorithms tend to become considerably easier if one allows to
trade-off in the number of clusters. In particular, one can compute a constant factor approximation to the
optimal k-median/means clustering using O(k) centers in O(nk) time. The algorithm succeeds with constant
probability. See the work by Indyk [Ind99] and Chen [Che06] and references therein.

Facility Location. All the problems mentioned here fall into the family of facility location problems.
There are numerous variants. The more specific facility location problem is a variant of k-median clustering
where the number of clusters is not specified, but instead one has to pay to open a facility in a certain
location. Local-search also works for this variant.

Local search. As mentioned above, local search also works for k-means clustering [AGK+ 01]. A collec-
tion of some basic problems for which local search works is described in the book by Kleinberg and Tardos
[KT06]. Local search is a widely used heuristic for attacking NP-H problems. The idea is usually to
start from a solution and try to locally improve it. Here, one defines a neighborhood of the current solution,
and one tries to move to the best solution in this neighborhood. In this sense, local search can be thought
of as a hill-climbing/EM (expectation maximization) algorithm. Problem for which local search was used
include Vertex Cover, Traveling Salesperson and Satisfiability, and probably many more problems.
Provable cases where local search generates a guaranteed solution are less common and include facility
location, k-median clustering [AGK+ 01], k-means clustering [KMN+ 04], weighted max cut, metric labeling
problem with the truncated linear metric [GT00], and image segmentation [BVZ01]. See [KT06] for more
references and a nice discussion of the connection of local search to the Metropolis algorithm and simulated
annealing.

4.6 Exercises
Exercise 4.6.1 (Handling outliers.) [10 Points]
Given a point set P, we would like to perform k-median clustering of it, when we are allowed to ignore m
of the points. These m points are outliers which we would like to ignore since they represent irrelevant data.
Unfortunately, we do not know the m outliers in advance. It is natural to conjecture that one can perform a
local search for the optimal solution. Here ones maintain a set of k centers and a set of m outliers. At every
point in time the algorithm moves one of the centers or the outliers if it improves the solution.

60
Show that local-search does not work for this problem; namely, the approximation factor is not a con-
stant.

Exercise 4.6.2 (Handling outliers for k-center clustering.) [10 Points]

Given P, k and m, present an a polynomial time algorithm that computes a constant factor approximation
to the optimal k-center clustering of P with m outliers. (Hint: Assume first that you know what is the radius
of the optimal solution.)

61
62
Chapter 5

On Complexity, Sampling, and ε-Nets and

ε-Samples

“I’ve never touched the hard stuff, only smoked grass a few times with the boys to be polite, and that’s all, though
ten is the age when the big guys come around teaching you all sorts to things. But happiness doesn’t mean much to
me, I still think life is better. Happiness is a mean son of a bitch and needs to be put in his place. Him and me aren’t
on the same team, and I’m cutting him dead. I’ve never gone in for politics, because somebody always stand to gain
by it, but happiness is an even crummier racket, and their ought to be laws to put it out of business.”
– – Momo, Emile Ajar

In this chapter we will try to quantify the notion of geometric complexity. It is intuitively clear that a a
c
(i.e., disk) is a simpler shape than an (i.e., ellipse), which is in turn simpler than a - (i.e., smiley). This
becomes even more important when we consider several such shapes and how they interact with each other.
As these examples might demonstrate, this notion of complexity is somewhat elusive.
Next, we show that one can capture the structure of a distribution/point set by a small subset. The size
here would depend on the complexity of the shapes/ranges we care about, but it would be independent of
the size of the point set.

5.1 VC Dimension
Definition 5.1.1 A range space S is a pair (X, R), where X is a ground set (finite or infinite) set and R is
a (finite or infinite) family
of subsets
of X. The elements of X are points and the elements of R are ranges.

For A ⊆ X, let R|A = r ∩ A r ∈ R denote the projection of R on A.

If R contains all subsets of A (i.e., if A is finite, we have R = 2|A| ) then A is shattered by R.
|A |A
The Vapnik-Chervonenkis dimension (or VC dimension) of S , denoted by dimVC (S ), is the maximum
cardinality of a shattered subset of X. If there are arbitrarily large shattered subsets then dimVC (S ) = ∞.

5.1.1 Examples

Intervals. Consider the set X to be the real line, and R be the set of all intervals on the real 1 2
p q r
line. Clearly, for two points in the real line, say, A = {1, 2} one can find 4 intervals that
contain all possible subsets of A. However, this is false for a set of three points B = {p, q, r}, since there is no
interval that can contain the two extreme points p and r without also containing q. Namely, the subset {p, r}
is not realizable for intervals. Implying that the largest shattered set by the range space (real line, intervals)
if of size two. We conclude that the VC-dimension of this space is two.

63
Disks. Let X = IR2 , and let R be the set of disks in the plane. Clearly, for three points in the
plane 1, 2, 3, one can find 8 disks that realize all possible 23 different subsets. See figure on
the right. 1
But can disks shatter a set with four points? Consider such a set P of four points, and 3
there are two possible options. Either the convex-hull of P has three points on its boundary,
2
and in this case, the subset having those vertices in the subset but not including the middle {1.2}

point is impossible, by convexity.

d Alternatively, if all four points are vertices of the convex hull, and they are
a, b, c, d along the boundary of the convex hull, either the set {a, c} or the set {b, d}
is not realizable. Indeed, if both options are realizable, then consider the two disks
a
c D1 and D2 the realizes those assignments. Clearly, D1 and D2 must intersect in four
points, but this is not possible, since two disks have at most two intersection points.
See figure on the left.
b
Convex sets. Consider the range space S = (IR2 , R), where R is the set of all (closed) convex
sets in the plane. We claim that the dimVC (S) = ∞. Indeed, consider a set U of n points
p1 , . . . , pn all lying on the boundary of the unit circle in the plane. Let V be any subset of U,
and consider the convex-hull CH(V). Clearly, CH(V) ∈ R, and furthermore, CH(V)∩U = V.
CH(V)
Namely, any subset of U is realizable by S. Thus, S can shatter sets of arbitrary size, and its
VC-dimension is unbounded.
Complement. Consider the range space S = (X, R) with d = dimVC (S). Next, consider the
complement space, S = (X, R), where
R = X \ r r ∈ R ;

namely, the ranges of S are the complement of the ranges in S. What is the VC-dimension of S? Well,
clearly a set B ⊆ X is shattered by S, if and only if it is shattered by S. Thus, dimVC S = dimVC (S).

Lemma 5.1.2 For a range space S = (X, R) we have that for the complement range space S it holds
dimVC (S) = dimVC S .

5.1.1.1 Half spaces

Let S = (X, R), where X = IRd and R is the set of all (closed) halfspaces in IRd . To see what is the VC-
dimension of S , we need the following result of Radon.

Theorem 5.1.3 (Radon’s Lemma) Let A = {p1 , . . . , pd+2 } be a set of d + 2 points in IRd . Then, there exists
two disjoint subsets C and D of A, such that CH(C) ∩ CH(D) , ∅.
P P
Proof: We claim that there exists β1 , . . . , βd+2 , non all of them zero, such that i βi pi = 0 and i βi = 0.

Proof: If pd+1 = pd+2 then we are done. Otherwise, without loss of generality, assume that p1 , . . . , pd spans IRd . Then, there are
P P P P
two non-zero combinations of p1 , . . . , pd , such that pd+1 = di=1 αi pi and pd+2 = di=1 γi pi . Let α = di=1 αi − 1 and γ = di=1 γi − 1.
Pd
If α = 0 then i=1 αi pi − pd+1 (which
Pis the origin) is the required
combination. Similarly, we are done if γ = 0. Otherwise, consider
P
the point di=1 (αi /α)pi − pd+1 /α − di=1 (γi /γ)pi − pd+2 /γ . Clearly, this is the required point.

Pk
Assume, for the sake of simplicity of exposition, that the β1 , . . . βk ≥ 0 and βk+1 , . . . , βd+2 < 0. Furthermore, let µ = i=1 βi .
We have that
X k X
d+2
βi pi = − βi pi .
i=0 i=k+1

64
Pk Pd+2
In particular, v = i=0 (βi /µ)pi is a point in the CH({p1 , . . . , pk }) and i=k+1 −(βi /µ)pi ∈ CH({pk+1 , . . . , pd+2 }). We conclude that v
is in the intersection of the two convex hulls, as required.

In particular, this implies that if a set Q of d + 2 points is being shattered by S , we can partition this set
Q into two disjoint sets A and B such that CH(A) ∩ CH(B) , ∅. It should now be clear that any halfspace
h+ containing all the points of A, must also contain a point of the CH(B). But this implies that a point of
B must be in h+ . Namely, the subset A can not be realized by a halfspace, which implies that Q can not be
shattered. Thus dimVC (S ) < d + 2. It is also easy to verify that the regular simplex with d + 1 vertices is
being shattered by S . Thus, dimVC (S ) = d + 1.

5.2 Shattering Dimension and the Dual Shattering Dimension

The main property of a range space with bounded VC-dimension is that the number of ranges that one has
to consider for a set of n elements, grows polynomially in n (with the power being the dimension), instead
of exponentially. Formally, let
Xd ! X d
n ni
Gd (n) = ≤ ≤ nd , (5.1)
i=0
i i=0
i!

for d > 1 (the cases where d = 0 or d = 1 are not interesting and we will just ignore them with contempt).
Note that for all n, d ≥ 1, we have Gd (n) = Gd (n − 1) + Gd−1 (n − 1)¯ .

Lemma 5.2.1 (Sauer’s Lemma) If (X, R) is a range space of VC-dimension d with |X| = n then |R| ≤ Gd (n).

Proof: The claim trivially holds for d = 0 or n = 0.

Let x be any element of X, and consider the sets

R x = r \ {x} r ∪ {x} ∈ R and r \ {x} ∈ R and R \ x = r \ {x} r ∈ R .

Observe that |R| = |R x |+|R \ x|. Indeed, we charge the elements of R to their corresponding element in R \ x.
The only bad case is when there is a range r such that both r ∪ {x} ∈ R and r \ {x} ∈ R, because then these
two distinct ranges get mapped to the same range in R \ x. But such ranges contribute exactly one element
to R x .
Observe that (X \ {x} , R x ) has VC dimension d − 1, as the largest set that can be shattered is of size d − 1.
Indeed, any set B ⊂ X \ {x} shattered by R x , implies that B ∪ {x} is shattered in R.
Thus,
|R| = |R x | + |R \ x| ≤ Gd−1 (n − 1) + Gd (n − 1) = Gd (n),
by induction.

Definition 5.2.2 (Shatter function.) Given a range space S = (X, R), its shatter function πS (m) is the
maximum number of sets that might be created by S when restricted to subsets of size m. Formally,

πS (m) = max R|B
B⊂X
|B|=m

The shattering dimension of S is the smallest d such that πS = O(md ), for all m.
¯
Here is a cute (and standard) counting argument: Gd (n) is just the number of different subsets of size at most d out of n elements.
Now, we either decide to not include the first element in these subsets (i.e., Gd (n − 1)) or, alternatively, we include the first element
in these subsets, but then there are only d − 1 elements left to pick (i.e., Gd−1 (n − 1)).

65
By applying Lemma 5.2.1, to a finite subset of X, we get:

Corollary 5.2.3 If S = (X, R) is a range space of VC-dimension d then for every finite subset B of X, we
have R|B ≤ πS (|B|) ≤ Gd (|B|).
Namely, the VC-dimension of a range space always bound the shattering dimension of this range space.

Proof:
For the second part, let n = |B|, and observe that R|B ≤ Gd (n) ≤ nd , by Eq. (5.1). As such,
R|B ≤ nd , and, by definition, the shattering dimension of S is at most d; namely, the shattering dimension
is bounded by the VC-dimension.

Disks revisited. To see why the shattering dimension is more convenient to work with than VC-dimension,
consider the range space S = (X, R), where X = IR2 , and R is the set of disks in the plane. We know that the
VC-dimension of S is 3 (see Section 5.1.1).
Now, consider the shattering dimension of this range space. Let P be a set of n points in the plane, and
observe that given a disk in the plane, we can continuously deform it till it passes through three points of P
on its boundary, and no point outside the disk is now inside its interior, and vice versa. As such, the number
of subsets of P that one can realize by intersecting it with a disk is bounded by n3 23 (we pick the three
vertices that determine the disk, and for each of the three vertices we determine whether we consider it to
be inside the disk or not). As such, the shatter dimension of S is 3.
That might not seem like a great simplification over the same bound we got by arguing about the VC-
dimension. However, the above argumentation give us a very powerful tool – the shattering dimension of a
range space defined by a family of shapes is always bounded by the number of points that determine a shape
in the family. Thus, the shattering dimension of, say, arbitrarily oriented rectangles in the plane is five, since
such a rectangle is uniquely determined by five points.

5.2.1 The Dual Shattering Dimension

Given a range space S = (X, R), consider a point p ∈ X. There is a set of ranges of R associated with p;
namely, the set of all ranges of R that contains p denoted by

Rp = r r ∈ R, the range r contains p .

This gives rise to a natural dual range space

to S. Formally,
the dual range space to a range space S = (X, R)
? ? ?
is the space S = (R, X ), where X = Rp p ∈ X .

To understand what the dual space is, consider X to be the plane, and R to
be a set of m disks. Then, in the dual range space S? = (R, X? ), every point
p in the plane has a set associated with it in X? , which is the sets of disks of
R that contains p. In particular, if we consider the arrangement formed by the q
p
m disks of R, then all the points lying inside a single face of this arrangement
correspond to the same set of X? . Namely, the number of ranges in X? is
bounded by the complexity of the arrangement of these disks, which is O(m2 ).
Thus, let the dual shatter function of the range space S be π?S (m) = πS? (m), Figure 5.1: R p = Rq .
where S? is the dual range space to S. The dual shattering dimension of S is
just the shattering dimension of S? .
Note, that the dual shattering dimension might be smaller than the shattering dimension or the VC-
dimension of the range space. Indeed, in the case of disks in the plane, the dual shattering dimension is
just 2, while the VC-dimension and the shattering dimension of this range space is 3. Note, also, that in

66
geometric settings bounding the dual shattering dimension is relatively easy, as all you have to do is to
bound the complexity of the arrangement of m ranges of this space.
The following lemma shows a connection between the VC-dimension of a space and its dual. The proof
is left as an exercise for the interested reader (see Exercise 5.7.1).

Lemma 5.2.4 Consider a range space S = (X, R) with VC-dimension d. The dual range space S? = (R, X? )
has VC dimension bounded by 2d .

5.2.1.1 Mixing range spaces

Lemma 5.2.5 Let S = (X, R) 0 0
and T = (X, R ) be two range spaces of dimension d and d , respectively,
R = r ∪ r0 r ∈ R, r0 ∈ R0 . Then, for the range space b
where d, d0 > 1. Let b S = (X, b
R), we have that

VC b
S = O((d + d0 ) log(d + d0 )).

Proof: Let B be a set of n points in X that are being shattered by b S. There are Gd (n) and Gd0 (n) different
0
assignments for the elements of B by ranges of R and R , respectively. Every subset C of B realized byb r∈b R,
0 0 0
is a union of two subsets B ∩ r and B ∩ r where r ∈ R and r ∈ R . Thus, the number of different subsets of
B realized by b
0
S is bounded by Gd (n)Gd0 (n). Thus, 2n ≤ nd nd , for d, d0 > 1. We conclude n ≤ (d + d0 ) lg n,
which implies that n ≤ c(d + d0 ) log(d + d0 ), for some constant c.

Corollary 5.2.6 Let S = (X,R) andT = (X, R0 ) be two range spaces of VC-dimension d and d0 , respectively,

R = r ∩ r0 r ∈ R, r0 ∈ R0 . Then, for the range space b
where d, d0 > 1. Let b S = (X, bR), we have that
dimVC (b
S) = O((d + d0 ) log(d + d0 ))

Proof: Observe that r ∩ r0 = r ∪ r0 , and thus the claim follows by Lemma 5.1.2 and Lemma 5.2.5.
In fact, we can summarize the above observations, as follows.

Corollary 5.2.7 Any finite sequence of combining range spaces with finite VC-dimension results in a range
space with a finite VC-dimension.

5.3 On ε-nets and ε-sampling

5.3.1 ε-nets and ε-samples
Definition 5.3.1 (ε-sample) Let S = (X, R) be a range space, and let B be a finite subset of X. For 0 ≤ ε ≤ 1,
a subset C ⊆ B, is an ε-sample for B if for any range r ∈ R, we have

|B ∩ r| − |C ∩ r| ≤ ε.
|B| |C|

Namely, ε-sample is a subset of the ground set that “captures” the range space up to an error of ε.
Namely, to estimate (approximately) the fraction of the ground set covered by a range r, it is sufficient to
count the points of C that falls inside r.
If B = X and B is a finite set we will abuse notations slightly, and refers to C as a ε-sample for S.

The author is quite aware that the interest of the reader in this issue might not be the result of free choice. Nevertheless, one
might draw some comfort from the realization that the existence of the interested reader is as much of an illusion as the existence
of free choice. Both are convenient to assume, and both are probably false. Or maybe not.

67
To see the usage of such a sample, consider C = X to be, say, the population of a country (i.e., an
element of X is a citizen). A range in R is the set of all people in the country that answer yes to a question
(i.e., would you vote for party Y? would you buy a bridge from me? stuff like that). An ε-sample of this
range space enable us to estimate reliably (up to an error of ε) the answer for all these questions, by just
asking the people in the sample.
The natural question of course is to find such a subset of small (or minimal) size.

Theorem 5.3.2 (ε-sample theorem, [VC71]) There is a positive constant c such that if (X, R) is any range
space with VC-dimension at most d, B ⊆ X is a finite subset and ε, δ > 0, then a random subset C of
cardinality s of B where s is at least the minimum between |B| and
!
c d 1
d log + log
ε2 ε δ
is an ε-sample for B with probability at least 1 − δ.
Sometimes it is sufficient to have (hopefully smaller) samples with a weaker property – if a range is
“heavy” then there is an element in our sample that is in this range.

Definition 5.3.3 (ε-net) A set N ⊆ B is an ε-net for B, if for any range r ∈ R, if |r ∩ B| ≥ ε |B| implies that
r contains at least one point of N (i.e., r ∩ N , ∅).

Theorem 5.3.4 (ε-net theorem, [HW87]) Let (X, R) be a range space of VC-dimension d, let B be a finite
subset of X and suppose 0 < ε, δ < 1. Let N be a set obtained by m random independent draws from B,
where !
4 2 8d 8d
m ≥ max log , log . (5.2)
ε δ ε ε
Then N is an ε-net for B with probability at least 1 − δ.
The above two theorems also holds for spaces with shattering dimension at most d. The constant in the
sample size deteriorates a bit.

5.3.2 Some Applications

We mention two easy applications of these theorems, demonstrating (hopefully) the power of these theorems.

5.3.2.1 Range searching

So, consider a (very large) set of points P in the plane, and we would like to be able to quickly decide how
many points are included inside a query rectangle, and let assume that we allow ourselves 1% error (of the
whole “population”). What Theorem 5.3.2 tells us, is that there is a subset of constant size (that depends
only on ε) that can be used to perform this estimation, and it works for all query rectangles (we used here
the fact that rectangles in the plane have finite VC-dimension). In fact, a random sample of this size works
with constant probability.

5.3.2.2 Learning a concept

Assume that we have a function f defined in the plane that returns ‘1’ inside an (unknown) disk, and ‘0’
outside it. There is some distribution D defined over the plane, and we pick points from this distribution.
Furthermore, we can compute the function for these labels (i.e., we can compute f for certain values, but it
is expensive).

68
Theorem 5.3.2 tells us that if we pick (roughly) O((1/ε) log(1/ε)) random points in a sample R from
this distribution, compute the labels for the samples, and find the smallest disk D that contains the sampled
labeled by ‘1’ and does not contain any of the ‘0’ points, then the function g that returns ‘1’ inside the disk
and ‘0’ otherwise, correctly classifies all but ε-fraction of the points (i.e., the probability of misclassifying a
point picked according to the given distribution is smaller than ε).
To see that, consider the range space S having the plane as the ground set, and the symmetric difference
between two disks as the ranges. By Corollary 5.2.7, this range space has finite VC dimension. Now,
consider the (unknown) disk D0 that induces f and the region r = D ⊕ D0 . Clearly, the learned classifier g
returned incorrect answer only for points picked inside r.
So
h ithe probability for a mistake in the classification, is the measure of r under the distribution D. So, if
PrD r > ε then by the ε-net Theorem (i.e., Theorem 5.3.4) the set R is an ε-net for S (ignore for the time
being the possibility that the random sample fails to be an ε-net) and as such, R contains a point q inside r.
But then, it is not possible that g (which classifies correctly all the sampled points of R) make a mistake on
q. Ah contradiction,
i because by construction, the range r is where g misclassifies points. We conclude that
PrD r ≤ ε, as desired.

Tell me lies, tell me sweet little lies. The careful reader might be tearing her hair out because of the above
description. First, Theorem 5.3.4 might fail, and the above conclusion might not hold. This is of course
true, and in real applications one might use much larger sample to guarantee that the probability of failure
is so small that it can be practically ignored. A more serious issue is that Theorem 5.3.4 is defined only
for finite sets. Nowhere does it speak about a continuous distribution. Intuitively, one can approximate a
continuous distribution to an arbitrary precision using a huge sample, and apply the theorem to this sample
as our ground set. A formal proof is more tedious and requires extending the proof of Theorem 5.3.4 to
continuous distributions. This is straightforward and we will ignore this topic altogether.

A Naive Proof of the ε-Sample Theorem. To demonstrate why the ε-sample/net Theorems are interesting, let us try to prove
the ε-sample Theorem in the natural naive way. Thus, consider a finite range space S = (X, R) with shattering dimension d. And
consider a range r that contains, say, a p fraction of the points of X, where p ≥ ε. Consider a random sample R of r points from X,
picked with replacement.
P
Let pi be the ith sample point, and let Xi be an indicator variable which is one if and only if pi ∈ r. Clearly, ( i Xi )/r is an
estimate for p = |r ∩ X| / |X|. We wouldP like this estimate to be within ±ε of p, and with confidence ≥ 1 − δ.
P
As such, the sample failed if ri=1 Xi − pr ≥ ∆ = εr = (ε/p)pr. Set φ = ε/p and µ = E i Xi = pr. Using the Chernoff
inequality (Theorem 25.2.6 and Theorem 25.2.9) we have
 r   r 
X  X 

Pr 
Xi − pr ≥ (ε/p)pr = Pr  Xi − µ ≥ φµ ≤ exp −µφ2 /2 + exp −µφ2 /4
i=1 i=1
!
ε2
≤ 2 exp −µφ2 /4 = 2 exp − r ≤ δ,
4p
& ' & '
4 1 4p 1
for r ≥ 2 ln ≥ 2 ln .
ε δ ε δ
Viola! We had proved the ε-sample Theorem. Well, not quite. We proved that the sample works correctly for a single range.
The problem is that the number of ranges that we need to prove the theorem for is πS (|X|) (see Definition 5.2.2). In particular, if we
plug confidence δ/πS (|X|) to the above analysis, and use the union bound, we get that for
& '
4 πS (|X|)
r ≥ 2 ln
ε δ

the sample estimates correctly (up to ±ε) the size of all ranges with confidence
≥ 1 − δ. Bounding
πS (|X|) by O(|X|d ) (using Eq. (5.1)
−2
for a space with VC-dimension d), we can bound the required size of r by O dε log(|X| /δ) .
Namely, the “naive” argumentation gives us a sample bound which depends on the underlying size of the ground set. However,
the sample size in the ε-sample Theorem (Theorem 5.3.2) is independent of the size of the ground set. This is the magical property
of the ε-sample Theorem® .

69
5.3.3 A quicky proof of Theorem 5.3.4
Here we provide a sketchy proof of Theorem 5.3.4, which conveys the main ideas. The full proof in all its glory and details is
provided in Section 5.5.
Let N = (x1 , . . . , xm ) be the sample obtained by m independent samples from A. Let E1 be the probability that N fails to be an
ε-net. Namely,

E1 = ∃r ∈ R |r ∩ A| ≥ εn and r ∩ N = ∅ .

To complete the proof, we must show that Pr[E1 ] ≤ δ. Let T = (y1 , . . . , ym ) be another random sample generated in a similar
fashion to N. Let E2 be the event that N fails, but T “works”, formally

εm
E2 = ∃r ∈ R |r ∩ A| ≥ εn, r ∩ N = ∅ and |r ∩ T | ≥ .
2
Intuitively, since ET [|r ∩ T |] ≥ εm, then for the range r that N fails for, we have with “good” probability that |r ∩ T | ≥ εn/2.
Namely, Pr[E1 ] ≈ Pr[E2 ].
Next, let
εm
E20 = ∃r ∈ R r ∩ N = ∅, |r ∩ T | ≥ ,
2
h i
Clearly, E2 ⊆ E20 and as such Pr[E2 ] ≤ Pr E20 . Now, fix Z = N ∪ T , and observe that |Z| = 2m. Next, fix a range r, and observe
that the bad probability of E20 is maximized if |r ∩ Z| = εm/2. Now, the probability that all the elements
of r ∩ Z falls only into the
second half the sample is at most 2−εm/2 as a careful calculation shows. Now, there are at most Z|R ≤ Gd (2m) different ranges that
h i
one has to consider. As such, Pr[E1 ] ≈ Pr[E2 ] ≤ Pr E20 ≤ Gd (2m)2−εm/2 and this is smaller than δ, as a careful calculation shows,
by just plugging the value of m into the right hand side.

5.4 Discrepancy
The proof of the ε-sample/net Theorem is somewhat complicated. It turns out that one can get a somewhat similar result by
attacking the problem from the other direction; namely, let us assume that we would like to take a truly large sample of a finite
range space S = (X, R) defined over n elements with m ranges. We would like this sample to be as representative as possible as far
as S is concerned. In fact, let us decide that we would like to pick exactly half of the points of X to our sample (assume that n = |X|
is even).
To this end, let us color half of the points of X by −1 (i.e., black) and the other half by 1 (i.e., white). If for every range, r ∈ R,
the number of black points inside it is equal to the number of white points, then doubling the number of black points inside a range,
gives us the exact number of points inside the range. Of course, such a perfect coloring is unachievable in almost all situations. To
see this, consider the clique graph K4 – clearly, in any coloring of the vertices, there must be an edge with two endpoints having the
same color.
Formally, let χ : X → {−1, 1} be the given coloring. The discrepancy of χ over a range r is the amount of imbalance in the
coloring inside χ. Namely,
X
|χ(r)| = χ(p) .
p∈r
The overall discrepancy of χ is disc(χ) = maxr∈R |χ(r)|. The discrepancy of a (finite) range space S = (X, R) is the discrepancy of
the best possible coloring; namely,
disc(S) = min disc(χ).
χ:X→{−1,+1}

The natural question is, of course, how to compute the coloring χ of minimum discrepancy. This seems like a very challenging
question, but when you do not know what to do, you might as well do something random. So, let us pick a random coloring χ of
X. To this end, let P be an arbitrary partition of X into pairs (i.e., a perfect matching). For a pair {p, q} ∈ P, we will either color
χ(p) = −1 and χ(q) = 1, or the other way around; namely, χ(p) = 1 and χ(q) = −1. We will decide how to color this pair using
a single coin flip. Thus, our coloring would be induced by making such a decision for every pair of P, and let χ be the resulting
coloring. We will refer to χ as compatible with the partition P if χ({p, q}) = 0, for all {p, q} ∈ P.

70
Consider a range r. If a pair {p, q} ∈ P falls completely inside r or completely outside r than it does not
contribute anything to the discrepancy of r. Thus, the only pairs that contribute to the discrepancy of r are the crossing
ones that cross it. Namely, {p, q} ∩ r , ∅ and {p, q} ∩ (X \ r) , ∅. In particular, let #r denote the number of
pair
crossing pairs for r, and let Xi ∈ {−1, +1}
√ be the indicator variable which is the contribution of the ith crossing
pair to the discrepancy of r. For ∆r = 2#r ln(4m), we have by the Chernoff inequality (Theorem 25.2.1), that
  !
X  ∆2 1
Pr |χ(r)| ≥ ∆r ≤ 2 Pr  Xi ≥ ∆r  ≤ 2 exp − r = .
i
2# r 2m r
Since there are m ranges in R, it follows that with good probability (i.e., at least half) for all r ∈ R the discrepancy
of r is at most ∆r .

Theorem 5.4.1 Let S = (X, R) be a range space defined over n = |X| elements with m = |R| ranges. Consider any partition P of the
elements of X into pairs. Then, with probability ≥ 1/2, for any range r ∈ R, a random coloring χ : X → {−1, +1} that is compatible
with the partition P, has discrepancy at most p
|χ(r)| ≤ ∆r = 2#r ln(4m),
√
where #r denote the number of pairs of P that crosses r. In particular, since #r ≤ |r|, we have |χ(r)| ≤ 2 |r| ln(4m).

Observe that for every range r it holds #r ≤ n/2, since 2#r ≤ |X|. As such, we have:

Corollary 5.4.2 Let S = (X, R) be a range space defined over n elements√with m ranges. Let P be an arbitrary partition of X into
pairs. Then a random coloring which is compatible with P has disc(χ) ≤ n ln(4m), with probability ≥ 1/2.

One can easily amplify the probability of success of the coloring, by increasing the threshold. In particular, for any constant
c ≥ 1, one has that p
∀r ∈ R |χ(r)| ≤ 2c #r ln(4m),
2
with probability ≥ 1 − .
(4m)c

5.4.1 Building ε-sample via Discrepancy

Let S = (X, R) be a range space with shattering dimension d. Let P ⊆ X be a set of n points, and consider the induced range space

S|P = P, R|P , see Definition 5.1.1. Without loss of generality, we assume that n is a power of 2. Consider a coloring χ of P with
discrepancy bounded by Corollary 5.4.2. In particular, let Q be the points of P colored by, say, −1. We know that |Q| = n/2, and
for any range r ∈ R, we have that p
χ(r) = |(P \ Q) ∩ r| − |Q ∩ r| ≤ c n ln(nd ),
for some absolute constant c. Observe that |(P \ Q) ∩ r| = |P ∩ r| − |Q ∩ r|. In particular, we have that for any range r, it holds
p
|P ∩ r| − 2 |Q ∩ r| ≤ c n ln(nd ). (5.3)

Dividing both sides by n = |X| = 2 |Q|, we have that

r
|P ∩ r| − |Q ∩ r| ≤ τ(n) = c d ln n .
|P| |Q| n
Namely, a coloring with low discrepancy yields a τ(n)-sample. Intuitively, if n is very large, then Q provides a good approximation
to P. However, we are happy with a ε-sample for a prespecified ε > 0. To this end, we can “chain” together several approximations.

Lemma 5.4.3 Let Q ⊆ P be a δ-sample for P (in some underlying range space S), and let R ⊆ Q be a ρ-sample for Q. Then R is
a (δ + ρ)-sample for P.

Proof: By definition, we have that, for every r ∈ R, it holds

|r ∩ P| − |r ∩ Q| ≤ δ and |r ∩ Q| − |r ∩ R| ≤ ρ.
|P| |Q| |Q| |R|

By adding the two inequalities together, we get |r∩|PP| | − |r∩|RR| | ≤ δ + ρ.
Thus, let P0 = P and P1 = Q. Now, in the ith iteration, we will compute a coloring χi−1 of Pi−1 with low discrepancy, as
guaranteed by Corollary 5.4.2, and let Pi be the points of Pi−1 colored white by χi−1 . Let δi = τ(ni−1 ), where ni−1 = |Pi−1 | = n/2i−1 .

71
P
By Lemma 5.4.3, we have that Pk is a ( ki=1 δi )-sample for P. Since we would like the smallest set in the sequence P1 , P2 , . . . that
P
is still an ε-sample, we would like to find the maximal k, such that ( ki=1 δi ) ≤ ε. We thus require that
s s s
X k Xk Xk
d ln(n/2i−1 ) d ln(n/2k−1 ) d ln nk−1
δi = τ(ni−1 ) = c ≤ c1 = c1 ≤ ε,
i=1 i=1 i=1
n/2i−1 n/2k−1 nk−1
where c1 is a sufficiently large constant. This holds for nk−1 ≥ (4c12 d/ε2 ) ln(c1 d/ε) as can be verified by an easy calculation. In
particular, taking the largest k for which this holds, results in a set Pk of size O((d/ε2 ) ln(d/ε)) which is an ε-sample for P.

Theorem 5.4.4 (ε-sample via discrepancy.) There is a positive constant c such that if (X, R) is any range space
with shattering

dimension at most d, B ⊆ X is a finite subset and ε, δ > 0, then there exists a subset C ⊆ B, of cardinality O (d/ε2 ) ln(d/ε) , such
that C is an ε-sample for B.

5.4.2 Building ε-net via Discrepancy

We need to be slightly more careful if we want to use discrepancy to build ε-nets, and we will use Theorem 5.4.1 instead of
Corollary 5.4.2 in the analysis. In particular, let r be a range in given space, and let
νi = |Pi ∩ r|
denote the size of the range r in the ith set Pi , for i ≥ 0. We have, by Theorem 5.4.1, that
p
|Pi−1 ∩ r| − 2 |Pi ∩ r| ≤ c d |Pi−1 ∩ r| ln(ni−1 ),

for some constant c, since the crossing number of a range r ∩ Pi−1 is always bounded by its size. This is equivalent to
i−1 p
2 ν − 2i ν ≤ c2i−1 dν ln n .
i−1 i i−1 i−1 (5.4)
We need the following technical claim that states the size of νk behaves as we expect: As long as the set Pk is large enough,
the size of νk is roughly ν0 /2k .

Claim 5.4.5 There is a constant c4 (independent of d), such that for all k with ν0 /2k ≥ c4 d ln nk , it holds (ν0 /2k )/2 ≤ νk ≤ 2(ν0 /2k ).

Proof: The proof is by induction. For k = 0 the claim trivially holds. Assume that it holds for i < k. Adding up the
inequalities of Eq. (5.4), for i = 1, . . . , k, we have that
r r
X k p X
k
ν0 ν0
ν0 − 2k νk ≤ c2i−1 dνi−1 ln ni−1 ≤ c2i−1 2d i−1 ln ni−1 ≤ c3 2k d k ln nk ,
i=1 i=1
2 2

for some constant c3 . Thus, r r

ν0 ν0 ν0 ν0
− c 3 d ln nk ≤ ν k ≤ + c 3 d k ln nk .
2k 2k 2k 2
!
ν0 p c3 ν0 ν0
2
By assumption, we have that ≥ d ln n k . Selecting c 4 ≥ 2c 3 , implies that ν k ≤ 1 + = 2 k . Similarly, we have
c4 2k c3 2k 2
 √ 
ν0  c3 d ln nk  ν0
νk ≥ k 1 − p  ≥ k /2,
2 ν0 /2k 2

since by assumption ν0 /2k is sufficiently large.

So consider a “heavy” range r that contains at least ν0 ≥ εn points of P. To apply Claim 5.4.5 we need a k such that
εn/2k ≥ c4 d ln nk−1 , or equivalently, that
nk c4 d
≥ ,
ln(2nk ) ε

which holds for nk = Ω dε ln dε . But then, Claim 5.4.5, we have that
!
|P ∩ r| 1 εn ε d
νk = |Pk ∩ r| ≥ ≥ · k = nk = Ω d ln > 0.
2 · 2k 2 2 2 ε

We conclude that the set Pk , which is of size Ω dε ln dε , is an ε-net.

Theorem 5.4.6 (ε-net via discrepancy.) There is a positive constant c such that if (X, R) is any range space with shattering di-
mension at most d, B ⊆ X is a finite subset and ε, δ > 0, then there exists a subset C ⊆ B, of cardinality O((d/ε) ln(d/ε)), such that
C is an ε-net for B.

72
5.5 Proof of the ε-net Theorem
In this section, we finally prove Theorem 5.3.4.
Let (X, R) be a range space of VC-dimension d, and let A be a subset of X of cardinality n. Suppose that m satisfies Eq. (5.2)p68 .
Let N = (x1 , . . . , xm ) be the sample obtained by m independent samples from A (the elements of N are not necessarily distinct, and
this is why we treat N as an ordered set). Let E1 be the probability that N fails to be an ε-net. Namely,

E1 = ∃r ∈ R |r ∩ A| ≥ εn and r ∩ N = ∅ .

(Namely, there exists a “heavy” range r that does not contain any point of N.) To complete the proof, we must show that Pr[E1 ] ≤ δ.
Let T = (y1 , . . . , ym ) be another random sample generated in a similar fashion to N. Let E2 be the event that N fails, but T “works”,
formally
εm
E2 = ∃r ∈ R |r ∩ A| ≥ εn, r ∩ N = ∅ and |r ∩ T | ≥ .
2
Intuitively, since ET [|r ∩ T |] ≥ εm, then for the range r that N fails for, we have with “good” probability that |r ∩ T | ≥ εn/2.
Namely, E1 and E2 have more or less the same probability.

Claim 5.5.1 Pr[E2 ] ≤ Pr[E1 ] ≤ 2 Pr[E2 ].

Proof: Clearly, E2 ⊆ E1 , and thus Pr[E2 ] ≤ Pr[E1 ]. As for the other part, note that by the definition of conditional probability,
we have
Pr E2 E1 = Pr[E2 ∩ E1 ] / Pr[E1 ] = Pr[E2 ] / Pr[E1 ] .

It is thus enough to show that Pr E2 E1 ≥ 1/2.
Assume that E1 occur. There is r ∈ R, such that |r ∩ A| > εn and r ∩ N = ∅. The required probability is at least the
probability that for this specific r, we have |r ∩ T | ≥ εn2 . However, |r ∩ T | is a binomial variable with expectation εm, and variance
ε(1 − ε)m ≤ εm. Thus, by Chebychev inequality (Theorem 25.1.2), it holds
" √εm √ # !2
εm εm 2 1
Pr |r ∩ T | < ≤ Pr |r ∩ T | − εm > = Pr |r ∩ T | − εm > εm ≤ √ ≤ ,
2 2 2 εm 2

by Eq. (5.2)p68 . Thus, for r ∈ E1 , we have

Pr[E2 ] h i εm 1
εn
≥ Pr |r ∩ T | ≥ 2
= 1 − Pr |r ∩ T | < ≥ .
Pr[E1 ] 2 2

Thus, it is enough to bound the probability of E2 . Let

εm
E20 = ∃r ∈ R r ∩ N = ∅, |r ∩ T | ≥ ,
2
Clearly, E2 ⊆ E20 . Thus, bounding the probability of E20 is enough to prove the theorem. Note however, that a shocking thing
happened! We no longer have A as participating in our event. Namely, we turned bounding an event that depends on a global
quantity, into bounding a quantity that depends only on local quantity/experiment. This is the crucial idea in this proof.
h i
Claim 5.5.2 Pr[E2 ] ≤ Pr E20 ≤ Gd (2m)2−εm/2 .

Proof: We imagine that we sample the elements of N ∪ T together, by picking Z = (z1 , . . . , z2m ) independently from A. Next,
we randomly decide the m elements of Z that go into N, and the remaining elements go into T . Clearly,
X 0
Pr E20 = Pr E2 Z Pr[Z] .
Z

Thus, from this point on, we fix the set Z, and we bound Pr[E20 |Z] (note, that Pr[E20 ] can be interpreted as averaging Pr[E20 |Z], thus
a bound on this quantity would imply the same bound on Pr[E20 ]).
It is now enough to consider the ranges in the projection space (Z, R|Z ). By Lemma 5.2.1, we have R|Z ≤ Gd (2m).
Let us fix any r ∈ R|Z , and consider the event
εm
Er = r ∩ N = ∅ and |r ∩ T | > .
2

73
For k = |r ∩ (N ∪ T )|, we have
2m−k

Pr[Er ] ≤ Pr r ∩ N = ∅ |r ∩ (N ∪ T )| > εm = m
2 2m
m
(2m − k)(2m − k − 1) · · · (m − k + 1)
=
2m(2m − 1) · · · (m + 1)
m(m − 1) · · · (m − k + 1)
= ≤ 2−k ≤ 2−εm/2 .
2m(2m − 1) · · · (2m − k + 1)
Thus, X
Pr E20 Z ≤ Pr[Er ] ≤ Z|R 2−εm/2 ≤ Gd (2m)2−εm/2 ,
r∈Z|R
h i
implying that Pr E20 −εm/2
≤ Gd (2m)2 .

Proof of Theorem 5.3.4. By Lemma 5.5.1 and Lemma 5.5.2, we have Pr[E1 ] ≤ 2Gd (2m)2−εm/2 . It is thus remains to verify that
if m satisfies Eq. (5.2), then 2Gd (2m)2−εm/2 ≤ δ.
P Pd (2m)i
Indeed, we know that 2m ≥ 8d and as such Gd (2m) = di=0 2m i
≤ i=0 i! ≤ (2m)d , for d > 1. Thus, it is sufficient to show
d −εm/2
that the inequality 2(2m) 2 ≤ δ holds. By taking lg of both sides and rearranging, we have that this is equivalent to
εm 2
≥ d lg(2m) + lg .
2 δ
By our choice of m (see Eq. (5.2)), we have that εm/4 ≥ lg(2/δ). Thus, we need to show that
εm
≥ d lg(2m).
4
8d 8d
We verify this inequality for m = ε
lg ε
. Indeed
!
8d 16d 8d
2d lg ≥ d lg lg .
ε ε ε
!2
8d 16d 8d 4d 8d
This is equivalent to ≥ lg . Which is equivalent to ≥ lg , which is certainly true for 0 ≤ ε ≤ 1 and d > 1.
ε ε ε ε ε
This completes the proof of the theorem.

5.6 Bibliographical notes

The exposition of the ε-net and ε-sample theorems is based on [AS00]. The proof of the ε-net theorem is due Haussler and Welzl
[HW87]. The proof of the ε-sample
theorem is due to Vapnik and Chervonenkis [VC71]. The bound in Theorem 5.3.2 can be
improved to O εd2 + ε12 log 1δ [AB99].
The beautiful alternative proof of both theorems via the usage of discrepancy is due to Chazelle and Matoušek [CM96]. The
discrepancy method is a beautiful topic which is quite deep mathematically, and we had just skimmed the thin layer of melting
water on top of the tip of the iceberg¯ . Two nice books on the topic are the books by Chazelle [Cha01] and Matoušek [Mat99]. The
book [Cha01] is currently available online for free from Chazelle webpage.
We will revisit discrepancy since in some geometric cases it yields better results than the ε-sample theorem. In particular, the
random coloring of Theorem 5.4.1 can be derandomized using conditional probabilities. One can then use it to get ε-sample/net
by applying it repeatedly. A faster algorithm results from a careful implementation of the sketch-and-merge approach that would
be described when discussing streaming. The disappointing feature of all the deterministic constructions of ε-samples/nets is that
their running time is exponential in the dimension d, since the number of ranges is usually exponential in d.
A similar result to the one derived by Haussler and Welzl [HW87], using a more geometric approach, was done independently
by Clarkson in the same time [Cla87]. Exposing the fact that VC dimension is not necessary if we are interested only in geometric
applications. This was later refined by Clarkson [Cla88] leading to a general technique that, in geometric settings, yields stronger
results than the ε-net theorem. (The journal version of this paper is [CS89] – an unfortunate merger of two important papers.)
This technique has numerous applications in discrete and computational geometry and leads to several “proofs from the book” for
several results in discrete geometry.
Exercise 5.7.2 is from Anthony and Bartlett [AB99].
¯
The iceberg is melting because of global warming, so sorry, climate change.

74
5.6.1 Variants and extensions
A natural application of the ε-sample theorem is to use it to estimate the weights of ranges. In particular, given a finite range space
(X, R) we would like to build a data-structure such that we can decide quickly, given a query range r, what is the number of points
of X inside r. We could always use a sample of size (roughly) O(ε−2 ) to get an estimate of the weight of a range, using the ε-sample
theorem. The error of the estimate is εn, where n = |X|; namely, the error is additive. The natural question is whether one can get
an additive estimate ρ, such that pn ≤ ρ ≤ (1 + ε)pn, where |r ∩ X| = pn.
In particular, a subset A ⊂ X is a (relative) (ε, p)-sample, if for each r ∈ R of weight exceeding pn, it holds

|r ∩ A| − |r ∩ X| ≤ ε |r ∩ X| .
|A| |X| |X|
2
Of course, one
√ can simply generate a εp-sample of size (roughly) O(1/(εp) ) by the ε-sample theorem. This is not very interesting
when p = 1/ n. Interestingly, the dependency on p can be improved.

Theorem 5.6.1 ([LLS01]) Let (X, R) be a range space with shattering dimension d, where |!X| = n, and let 0 < ε < 1 and 0 < p < 1
c 1 1
be given parameters. Then, consider a random sample A ⊆ X of size 2 d log + log , where c is a constant. Then, it holds
ε p p δ
that for each range r ∈ R of at least pn points, we have

|r ∩ A| − |r ∩ X| ≤ ε |r ∩ X| .
|A| |X| |X|
In other words, A is a (p, ε)-sample for (X, R). The probability of success is ≥ 1 − δ.

A similar result is achievable by using discrepancy, see Exercise 5.7.3.

5.7 Exercises
Exercise 5.7.1 (On the VC-dimension of the dual range space.) [5 Points]
Prove Lemma 5.2.4. Namely, given a range space S = (X, R) of VC-dimension d, prove that the dual range space S? = (R, X? )
has VC-dimension bounded by 2d .
[Hint: Represent a finite range space as a matrix, where the elements are the columns, and the sets are the rows. Interpret the
fact that a set of size d can be shattered in this setting. What is the dual range space in this settings?]

Exercise 5.7.2 (Flip and Flop) [20 Points]

(a) [5 Points] Let b1 , . . . , b2m be m binary bits. Let Ψ be the set of all permutations of 1, . . . , 2m, such that for any σ ∈ Ψ, we have
σ(i) = i or σ(i) = m + i, for 1 ≤ i ≤ m, and similarly, σ(m + i) = i or σ(m + i) = m + i. Namely, σ ∈ Ψ either leave the pair
i, i + m in their positions, or it exchange them, for 1 ≤ i ≤ m. As such |Ψ| = 2m .
Prove that for a random σ ∈ Ψ, we have
" Pm Pm #
bσ(i) bσ(i+m)
Pr i=1
2
− i=1 ≥ ε ≤ 2e−ε m/2 .
m m

(b) [5 Points] Let Ψ0 be the set of all permutations of 1, . . . , 2m. Prove that for a random σ ∈ Ψ0 , we have
" Pm Pm #
i=1 bσ(i) i=1 bσ(i+m) 2
Pr − ≥ ε ≤ 2e−Cε m/2 ,
m m
where C is an appropriate constant. [Hint: Use (a), but be careful.]
(c) [10 Points] Prove Theorem 5.3.2 using (b).

Exercise 5.7.3 Prove the following theorem using discrepancy.

Theorem 5.7.4 Let (X, R) be a range space with shattering dimension d, where |X| = n, and let 0 < ε < 1
and 0 < p < 1 be given parameters. Then one can construct a set A ⊆ X of size O( ε2dp ln εp
d
), such that,
for each range r ∈ R of at least pn points, we have

|r ∩ A| − |r ∩ X| ≤ ε |r ∩ X| .
|A| |X| |X|
In other words, A is a relative (p, ε)-sample for (X, R).
.

75
76
Chapter 6

Sampling and the Moments Technique

6.1 Vertical Decomposition

Given a set S of n segments in the plane, and a subset R ⊆ S, let A| (R) de-
note the vertical decomposition of the plane, formed by the arrangement A(R) of
the segments of R. This is the partition of the plane into interior disjoint vertical
trapezoids formed by erecting vertical walls through each vertex of A| (R). For-
mally, a vertex of A| (R) is either an endpoint of a segment of R, or an intersection
point of two of its segments. From each such vertex we shoot up (similarly, down)
a vertical ray till it hits a segment of R, or it continues all the way to infinity. See
figure on the right. σ
Note, that a vertical trapezoid is defined by at most 4 segments: two seg-
ments defining its ceiling and floor, and two segments defining the two inter-
section points that induce the two vertical walls on its boundary. Of course, a
vertical trapezoid might be degenerate, and thus defined by less segments (i.e., an
unbounded vertical trapezoid, or a triangle).
Vertical decomposition breaks the faces of the arrangement, that might be
arbitrarily complicated, into entities (i.e., vertical trapezoids) of constant com-
plexity. This make handling arrangements much easier computationally.
In the following, we assume that the segments of S have k intersection points overall, and we want to compute the arrangement
A = A(S); namely, compute the edges, vertices and faces of A(S). One possible way of doing it, is the following: Compute a
random permutation of the segments of S: S = hs1 , . . . , sn i. Let Si = hs1 , . . . , si i be the prefix of length i of S. Compute A| (Si )
from A| (Si−1 ), for i = 1, . . . , n. Clearly, A| (S) = A| (Sn ), and we can extract A(S) from it.

Randomized Incremental Construction (RIC). Imagine that we had computed the arrangement Bi−1 = A| (Si−1 ). In
the ith iteration we compute Bi by inserting si into the arrangement Bi−1 . This involves splitting some trapezoids (and merging
some others),

77
As a concrete example, consider the example on the right. Here we insert s in q
the arrangement. To this end we split the “vertical trapezoids” 4pqs and 4rqs, each
into three trapezoids. The two trapezoids σ0 and σ00 needs to be now merged together s
to form the new trapezoid which appears in the vertical decomposition of the new
arrangement. (Note, that the figure does not show all the trapezoids in the vertical
decomposition.) σ0 σ00
To facilitate this, we need to compute the trapezoids of Bi−1 that intersects si .
This is done by maintaining a conflict-graph. Each trapezoid σ ∈ A| (Si−1 ) maintains s
p r
a conflict list cl(σ) of the segments of S that intersects its interior. We also maintain
|
a similar structure for each segment, listing all the trapezoids of A (Si−1 ) that it currently intersects (in its interior). We maintain
those lists with cross-pointers, so that given an entry (σ, s) in the conflict-list of σ, we can find the entry (s, σ) in the conflict list of
s in constant time.
Thus, given si , we know what are the trapezoids that needs to be split (i.e., all the trapezoids in cl(si )).
Splitting a trapezoid σ by a segment si is the operation of computing a set of (at most) 4 trapezoids that
covers σ and have si on their boundary. We compute those new trapezoids, and next we need to compute the
conflict-lists of the new trapezoids. This can be easily done by taking the conflict-list of a trapezoid σ ∈ cl(si )
and distributing its segments among the O(1) new trapezoids that covers σ. Using careful implementation
this requires a linear time in the size of the conflict-list of σ.
In the above description, we ignored the need to merge adjacent trapezoids if they have identical floor s
i
and ceiling - this can be done by a somewhat straightforward and tedious implementation of the vertical-
decomposition data-structure, by providing pointers between adjacent vertical trapezoids, and maintaining the conflict-list sorted
(or by using hashing) so that merge operations can be done quickly. This is somewhat tedious but it can be done in linear time in
the input/output size involved as can be verified.

Claim 6.1.1 The (amortized) running time of constructing Bi from Bi−1 is proportional to the size of the conflict lists of the vertical
trapezoids in Bi \ Bi−1 (and the number of such new trapezoids).

Proof: We charge all the work involved in the ith iteration either to the conflict lists of the newly created trapezoids, or the deleted
conflict lists. Clearly, the running time of the algorithm in the ith iteration is linear in the total size of these conflict lists. Observe,
that every conflict get charged twice – when it is being created, and when it is being deleted. As such, the (amortized) running time
in the ith iteration is proportional to the total length of the newly created conflict lists.
Thus, to bound the running time of the algorithm, it is enough to bound the expected size of the destroyed conflict-lists in ith
iteration (and sum this bound on the n iterations carried out by the algorithm). Or alternatively, bound the expected size of the
conflict-lists created in the ith iteration.

Lemma 6.1.2 Let S be a set of n segments (in general position° ) with k intersection points. Let Si be the first i segments in a
random permutation of S. The expected size of Bi = A| (Si ), denoted by τ(i), (i.e., number of trapezoids in Bi ) is O i + k(i/n)2 .

Proof: Consider an intersection point p = s ∩ s0 , where s, s0 ∈ S. The probability that p is present in A| (Si ) is equivalent to the
probability that both s and s0 are in Si . This probability is
n−2
i−2 (n − 2)! i! (n − i)! i(i − 1)
α = n = · = .
(i − 2)! (n − i)! n! n(n − 1)
i

For each intersection point p in A(S) define an indicator

h i variable Xp , which is one if the two segments defining p are in the
random sample Si , and zero otherwise. We have that E Xp = α, and as such, by linearity of expectation, the expected number of
intersection points in the arrangement A(Si ) is
 
X  X h i X
E 
 Xp  = E Xp = α = kα,
p∈V p∈V p∈V

°
In this case, no two intersection points are the same, no two intersection points (or vertices) have the same x-coordinate, no
two segments lie on the same line, etc. Making geometric algorithm work correctly for all degenerate inputs is a huge pain that can
usually be handled by tedious and careful handling. Thus, we will always assume general position of the input. In other words, in
theory all geometric inputs are inherently good, while in practice they are all evil (as anybody that tried to implement geometric
algorithms can testify). The reader is encouraged not to use this to draw any conclusions on the human condition.

The proof is provided in excruciating detail to get the reader used to this kind of argumentation. I would apologize for this
pain, but it is a minor trifle, not to be mentioned, when compared to the other crimes in this manuscript.

78
where V is the set of k intersection points of A(S). Thus, since every endpoint of a segment of Si contributed its two endpoints to
the arrangement A(Si ), we have that the expected number of vertices in A(Si ) is
i(i − 1)
2i + k.
n(n − 1)
Now, the number of trapezoids in A| (Si ) is proportional to the number of vertices of A(Si ), which implies the claim.

6.1.1 Backward Analysis

In the following, we would like to consider the total amount of work involved in the ith iteration of the algorithm. The idea to
analyse these iteration is (conceptually) to run the algorithm for the first i iterations, and then run “backward” the last iteration.
So, imagine, that the overall size of the conflict-lists of the trapezoids of Bi is Wi , and total size of the conflict lists created
only in the ith iteration is Ci .
We are interested in bounding the expected size of Ci , since this is (essentially) the amount of work done by the algorithm
in this iteration. To this end, let s = si and observe that the structure of Bi is defined independently of the permutation Si , and
depends only on the (unordered) set Si = {s1 , . . . , si }. So, fix Si . What is the probability that si = s? Clearly, this is 1/i – being the
probability of s to be the last element in a permutation of i elements (i.e., we consider a random permutation of Si ).
Now, consider a trapezoid σ ∈ Bi . If σ was created in the ith iteration, than clearly si must be one of the (at most four) segments
that defines it. Since Bi is independent of the internal ordering of Si , it follows that Pr[σ ∈ (Bi \ Bi−1 )] ≤ 4/i. In particular, the
overall size of the conflict lists in the end of the ith iteration is
X
Wi = | cl(σ)|.
σ∈Bi

As such, the expected overall size of the conflict-lists created in the ith iteration is
X 4
E Ci Bi ≤ cl(σ) ≤ 4 Wi .
σ∈B
i i
i

2 2
By Lemma 6.1.2, the expected size of Bi is O(i + ki /n ). Let us guess (for the time being) that on average the size of the conflict
list of a trapezoid of Bi is about O(n/i). In particular, assume that we know that
! !
i2 n i
E[Wi ] = O i + 2 k =O n+k ,
n i n
by Lemma 6.1.2. Implying
" # !! !
E Ci = E E Ci Bi = E 4 Wi = 4 E[Wi ] = O 4 n + ki = O n + k .
i i i n i n
In particular, the expected amount of work in the ith iteration is proportional to E[Ci ]. Thus, the overall expected running time of
the algorithm is  n  !
X  X n
n k
E Ci  = O + = O(n log n + k).
i=1 i=1
i n

Theorem 6.1.3 Given a set S of n segments in the plane with k intersections, one can compute the vertical decomposition of A(S)
in expected O(n log n + k) time.

Intuition and discussion. What remains to be seen, is how we came up with the guess that the average size of a conflict-list
of a trapezoid of Bi is about O(n/i). Note, that ε-nets imply that the bound O((n/i) log i) holds with constant confidence (see
Theorem 5.3.4), so this result is only slightly surprising. To prove this, we present in the next section a “strengthening” of ε-nets to
geometric settings.
To get an intuition how we came up with this guess, consider a set P of n points on the line, and a random sample R of i points
from P. Let I b be the partition of the real line into open intervals by the endpoints of R that do not contain points of R in their
interior.
b (i.e., a one dimensional trapezoid) of I.
Consider a interval of I b It is intuitively clear that this interval (in expectation) would
contain O(n/i) points. Indeed, fix a point x on the real line, and imagine that we pick each point with probability i/n to the random
sample. The random variable which is the number of points of P we have to scan to the right of x till we “hit” a point that is in the
random sample behaves like a geometric variable with probability i/n, and as such its expected value is n/i. The same argument
works if we scan P to the left of x. We conclude that the number of points of P in the interval of I b that contains x but does not
contain any point of R is O(n/i) in expectation.
Of course, the vertical decomposition case is more involved. Instead of proving the required result for this case, we will prove
a more general result which can be applied in a lot of other settings.

79
6.2 General Settings
Let S be a set of objects. For a subset R ⊆ S, we define a collection of ‘regions’ called F (R). For vertical decomposition of
segments (i.e., Theorem 6.1.3), the objects are segments, the regions are trapezoids, and F (R) is the set of vertical trapezoids
forming A| (R). Let [
T = T (S) = F ( R)
R⊆S

denote the set of all possible regions defined by subsets of S. We associate two subsets D(σ), K(σ) ⊆ S with each region σ ∈ T .
®

The defining set D(σ) of σ is a subset of S defining the region σ (the precise a
requirements from this set are specified in the axioms below). We assume that for
c
every σ ∈ T , |D(σ)| ≤ d for a (small) constant d. The constant d is sometime referred
to as the combinatorial dimension. In the case of Theorem 6.1.3, each trapezoid σ e
is defined by at most 4 segments (or lines) of S that define the region covered by the σ
trapezoid σ, and this set of segments is D(σ). See figure on the right.
The killing set K(σ) of σ is the set of objects of S such that including any object d b
of K(∆) into R prevents σ from appearing in F (R) (i.e., the killing set is the conflict
f
list of σ, if σ is being created by the RIC algorithm). In many applications K(σ) is
just the set of objects intersecting the cell σ; this is also the case in Theorem 6.1.3, Figure 6.1: D(σ) = {b, c, d, e}
where K(σ) is the set of segments of S intersecting the interior of the trapezoid σ, see
Figure 6.1. The weight of σ is ω(σ) = |K(σ)|.
and K(σ) = { f }.

Let S, F (R) , D(σ), and K(σ) be such that for any subset R ⊆ S, the set F (R) satisfies the following axioms:
(i) For any σ ∈ F (R), it holds D(σ) ⊆ R and R ∩ K(σ) = ∅.
(ii) If D(σ) ⊆ R and K(σ) ∩ R = ∅, then σ ∈ F (R).
For any natural number r and a number t > 0, consider R to be a random sample of size r from S, and let’s denote

n
Ft (R) = σ ∈ F (R) ω(σ) ≥ t · .
r
This is the set of regions in F (R) with a weight that is t times larger than what we expect¯ . We intuitively expect the size of this
set to drop fast as t increases. So, let
#(r) = E F (R) and #t (r) = E Ft (R) ,
where the expectation is over random subsets R ⊆ S of size r. Note, that #(r) = #0 (r) is just the expected number of regions of a
random sample of size r. In words, #t (r) is the expected number of regions in a structure created by r random objects, such that
these regions have weight which is t times larger than the “expected” weight of n/r.
Let [
Tt (r) = F t ( R)
R⊆S,|R|=r

denote the set all t-heavy regions that might be created by a sample of size r.
In the following, S is a set of n objects complying with Axioms (i) and (ii), and

d = max |D(σ)|
σ∈T(S)

is the combinatorial dimension of the system induced by S.

Lemma 6.2.1 Let r ≤ n and t be parameters, such that 1 ≤ t ≤ r/d. Furthermore, let R be a sample of size r, R0 be a sample of size

r0 = br/tc, both from S. Let σ ∈ T be a trapezoid with weight ω = ω(σ) ≥ t(n/r). Then, Pr[σ ∈ F (R)] = O td 2−t Pr[σ ∈ F (R0 )] .

Intuitively, (but not quite correctly) Lemma 6.2.1 states that the probability of a t-heavy trapezoid to be created drops expo-
nentially with t.

An Almost Proof of Lemma 6.2.1. We provide a back of the envelope argument “proving” Lemma 6.2.1. A more formal proof is
provided in Section 6.5.
Let us pick R (resp., R0 ) by picking each element of S with probability p = r/n (resp. p = r0 /n). Note, that this sampling
is different than the one used by the lemma, but it provides samples having roughly the same size, and we expect the relevant

®
Paraphrasing Voltaire, this does not imply that a member of T lives in the best of all possible sets.
¯
These are the regions that are t times overweight. Speak about an obesity problem.

80
probabilities to remain roughly the same. Let δ = |D(σ)| and ω = ω(σ). We have that Pr[σ ∈ F (R)] = pδ (1 − p)ω (i.e., this is the
probability that we pick the elements of D(σ) to the sample, and do not pick any of the elements of K(σ) to the sample, which is
by Axiom (ii) exactly the event σ ∈ F (R)). Similarly, Pr[σ ∈ F (R0 )] = p0 δ (1 − p0 )ω . As such,
!δ !ω n − r ω
Pr[σ ∈ F (R)] pδ (1 − p)ω r/n 1 − r/n
α= = = ≤ (t + 1)d .
Pr[σ ∈ F (R )]
0 δ
p (1 − p )
0 0 ω r /n 1 − r /n
0 0 n − r0
Now,
n − r ω !ω !ω ! !ω ! !
r0 − r r/t − r 1 r 1 r
= 1+ ≤ 1+ ≤ 1 − 1 − ≤ exp − 1 − ω
n − r0 n − r0 n − r0 t n t n
! !
1 r n
≤ exp − 1 − t ≤ exp(−(t − 1)),
t n r

as ω ≥ t(n/r). As such, α ≤ (t + 1)d exp(−(t − 1)).

Since the formal proof is less enlightening than the above “almost proof”, we delegate it to the end of the chapter, see Sec-
tion 6.5. The following exponential decay lemma testifies that truly heavy regions are (exponentially) rare.

Lemma 6.2.2 Given a set S of n objects. Let r ≤ n and t be parameters, such that 1 ≤ t ≤ r/d, where d = maxσ∈T(S) |D(σ)|.
Assuming that Axioms (i) and (ii) above hold for any subset of S, then we have
r
#t (r) = O td 2−t # = O td 2−t #(r) . (6.1)
t

Proof: Let R be a random sample of size r from S and R0 be a random sample of size r0 = br/tc from S. Let Tt = Tt (r). We
have
 
X  X 
#t (r) = E |Ft (R)| = Pr[σ ∈ F (R)] = Ot 2d −t
Pr σ ∈ F (R ) 
0

σ∈Tt σ∈Tt
 
 d −t X 0 

= 
Ot 2 Pr σ ∈ F (R )  = O td 2−t #(r0 ) ,
σ∈T

by Lemma 6.2.1.

Theorem 6.2.3 Let R ⊆ S be a random subset of size r. Let #(r) = E[|F (R) |] and c ≥ 1 be an arbitrary constant. Then,
 
 X  n c
 c
E (ω(σ))  = O #(r) .
σ∈F(R)
r

Proof: For a subset R ⊆ S of size r, we have by Lemma 6.2.2, that

   
 X  X n c 
c
E  ω(σ)  = E t (|Ft−2 (R)| − |Ft−1 (R)|)
σ∈F(R)
r t≥1
n c X
≤ (t + 1)c · E |Ft (R)|
r t≥0
n c X n c X r
= (t + 1)c #t (r) = O t c + d 2−t #
r t≥0 r t≥0 t
 n c X  n c
 
= O#(r) t c + d 2−t  = O #(r) ,
r t≥0
r

since c and d are both constants.

81
6.3 Applications
6.3.1 Analyzing the RIC Algorithm for Vertical Decomposition

As shown in Lemma 6.1.2, #(i) = O i + k(i/n)2 . Thus, by Theorem 6.2.3, we have that
 
 X  n
E  ω(σ) = O τ(i) = O(n + ki/n) .
σ∈B
i
i

This is the missing piece in the analysis of Section 6.1.1.

6.3.2 Cuttings
Let S be a set of n lines in the plane, and let r be an arbitrary parameter. A (1/r)-cutting of S is a partition of the plane into constant
complexity regions such that each region intersect at most n/r lines of S. It is natural to try and minimize the number of regions in
the cutting, as cuttings are a natural tool for performing divide and conquer.
Consider the range space having S as its ground set, and vertical trapezoids as its ranges (i.e., given a vertical trapezoid σ, its
corresponding range is the set of all lines of S that intersects the interior of σ). This range space has a VC dimension which is a
constant as can be easily verified. Let X ⊆ S be a ε-net for this range space, for ε = 1/r. By Theorem 5.3.4 (ε-net theorem), there
exists such an ε-net X, of this range space, of size O(1/ε log(1/ε)) = O(r log r). Consider a vertical trapezoid σ in the arrangement
A| (X). It does not intersect any of the lines of X in its interior, and X is an ε-net for S. It follows, that σ intersects at most εn = n/r
lines of S in its interior. Since the arrangement A| (X) has complexity O(|X|2 ), we get the following result.

Lemma 6.3.1 There exists (1/r)-cutting of a set of segments in the plane of size O (r log r)2 .

Since an arrangement of n lines has at most n2 intersects, and the number of intersections of the lines intersecting a single
n/r
region in the cutting is at most 2 , this implies that any cutting must be of size Ω(r2 ). We can get cuttings of such size easily using
the moments technique.

Theorem 6.3.2 Let S be a set of n lines in the plane, and let r be a parameter. One can compute a (1/r)-cutting of S of size O(r2 ).

Proof: Let R be a random sample of size r, and consider its vertical decomposition A| (R). If a vertical trapezoid σ ∈ A| (R)
intersects at most n/r lines of S, then we can add it to the output cutting. The other possibility is that a σ intersects t(n/r) lines of
S, for some t > 1, and let cl(σ) ⊂ S be the conflict list of σ (i.e., the list of lines of S that intersect the interior of σ). Clearly, a
(1/t)-cutting for the set cl(σ) forms a vertical decomposition (clipped inside σ) such that each trapezoid in this cutting intersects at
most n/r lines of S. Thus, we compute such a cutting inside each such “heavy” trapezoid using the algorithm
Lemma
6.3.1, and
these subtrapezoids to the resulting cutting. Clearly, the size of the resulting cutting inside σ is O t2 log2 t = O t4 . The resulting
two-level partition is clearly the required cutting. By Theorem 6.2.3, the expected size of the cutting is
     
  X ω(σ) !4   r 4  X 
O#(r) + E 2  = O#(r) + 
E (ω(σ))4 
σ∈F(R)
n/r n σ∈F(R)
r 4 n 4 !
= O #(r) + · #(r) = O(#(r)) = O r2 ,
n r

since #(r) is proportional to the complexity of A(R) which is O(r2 ).

6.4 Bibliographical notes

The technique describe in this chapter is generally attributed to Clarkson-Shor [CS89], which is historically inaccurate as the
technique was developed by Clarkson [Cla88]. Instead of mildly confusing the water by referring to it as the Clarkson technique,
we decided to make sure to really confuse the reader, and refer to it as the moments technique. The Clarkson technique [Cla88] is
in fact more general and implies a connection between the number of “heavy” regions and “light” regions. The general framework
can be traced back to the earlier paper [Cla87]. This implies several beautiful results, which we might cover later in the book. The
interested reader is referred to the recent presentation by Sharir on this aspect [Sha03].
For the full details of the algorithm of Section 6.1, the interested reader is refereed to the following books [dBvKOS00, BY98].
Interestingly, in a some cases the merging stage can be skipped, see [Har00a].
Agarwal et al. [AMS94] presented a slightly stronger variant than the original version of Clarkson [Cla88], that allows a region
to disappear even if none of the members of its killing set are in the random sample. This stronger settings are used in computing

82
the vertical decomposition of a single face in an arrangement. Here an insertion of a faraway segment the random sample might cut
off a portion of the face of interest. In particular, in the settings of Agarwal et al. (ii) is replaced by

(ii’) If σ ∈ F (R) and R0 is a subset of R with D(σ) ⊆ R0 , then σ ∈ F (R0 ).

Interestingly, Clarkson [Cla88] did not prove Theorem 6.2.3 using the exponential decay lemma, but gave a direct proof.
Although, his proof implicitly contains the exponential decay lemma. We chosen the current exposition since it is technically only
slightly more challenging but provides a better intuition of what is really going on.
The exponential decay lemma (Lemma 6.2.2), was proved by Chazelle and Friedman [CF90]. The work of [AMS94] is a
further extension of this result. Another analysis was provided by Clarkson et al. [CMS93].
Another way to reach similar results, is using the technique of Mulmuley [Mul94a], which relies on a direct analysis on
‘stoppers’ and ‘triggers’. This technique is somewhat less convenient to use but is applicable to some settings where the moments
technique does not apply directly. Mulmuley came up with randomized incremental construction of vertical decomposition. Also,
his concept of the omega function might explain why randomized incremental algorithms perform better in practice than their worst
case analysis [Mul94b].
Backwards analysis in geometric settings was first used by Chew [Che86], and formalized by Seidel [Sei93]. Its similar to the
“leave one out” argument used in statistics for cross validation. The basic idea was probably known to the Greeks (or Russians or
French) at some point in time.
(Naturally, our summary of the development is cursory at best and not necessarily accurate, and all possible disclaimers apply.
A good summary is provided in the introduction of [Sei93].)

Sampling model. Our “almost” proof of Lemma 6.2.1 used a different sampling than the one used by the algorithm (i.e.,
sampling without replacement). Furthermore, Clarkson [Cla88] used random sampling with replacement. As a rule of thumb all
these sampling approaches are similar and yield similar results, and its a good idea to use which ever sampling scheme is the easiest
to analyse in figuring out whats going on. Of course, a formal proof requires analysing the algorithm in the sampling model its
uses.

Lazy randomized incremental construction. If one wants to compute a single face that contains a marking point in
an arrangement of curves, then the problem in using randomized incremental construction is that as you add curves, the region of
interest shrinks, and regions that were maintained should be ignored. One option is to perform flooding in the vertical decomposition
to figure out what trapezoids are still reachable from the marking point and maintaining only these trapezoids in the conflict graph.
Doing it in each iteration is way too expensive, but luckily one can use a lazy strategy that performs this clean up only logarithmic
number of times (i.e., you perform a clean up in an iteration if the iteration number is, say, a power of 2). This strategy complicates
the analysis a bit, see [dBDS95] for more details on this lazy randomize incremental construction technique. An alternative
technique was suggested by the author for the (more restricted) case of planar arrangements, see [Har00b]. The idea is to compute
only what the algorithm really need to compute the output, by computing the vertical decomposition in an exploratory online
fashion. The details are unfortunately overwhelming although the algorithm seems to perform quite well in practice.

Cuttings. The concept of cuttings was introduced by Clarkson. The first optimal size cutting were constructed by Chazelle and
Friedman [CF90], who proved the exponential decay lemma to this end. Our elegant proof follows the presentation by de Berg
and Schwarzkopf [dBS95]. The problem with this approach is that the constant involved in the cuttings size are awful° . Matoušek
[Mat98] showed that there (1/r)-cuttings with 8r2 + 6r + 4 trapezoids, by using level approximation. A different approach, was
taken by the author [Har00a], who showed how to get cuttings which seems to be quite small (i.e., constant-wise) in practice. The
basic idea is to do randomized incremental construction, but at each iteration greedily add all the trapezoids with conflict list small
enough to the output cutting. One can prove that this algorithm also generate O(r2 ) cuttings, but the details are not trivial as the
framework described in this chapter is not applicable for analyzing this algorithm.
Cuttings also can be computed in higher dimensions for hyperplanes, and in the place for well behaved curves, see [SA95].

Even more on randomized algorithms in geometry. We had only scratched the surface of this fascinating topic,
which is one of the corner stones of “modern” computational geometry. The interested reader should have a look in the books by
Mulmuley [Mul94a], Sharir and Agarwal [SA95], Matoušek [Mat02] and Boissonnat and Yvinec [BY98].

°
This is why all computations related to cuttings should be done on waiter’s bill pad. As Douglas Adams put it: “On a waiter’s
bill pad, reality and unreality collide on such a fundamental level that each becomes the other and anything is possible, within
certain parameters.”

83
6.5 Proof of Lemma 6.2.1
Proof of of Lemma 6.2.1: Let Eσ be the event that D(σ) ⊆ R and K(σ) ∩ R = ∅. Similarly, let E0σ be the event that D(σ) ⊆ R0 and
K(σ) ∩ R0 = ∅. By the axioms, we have that σ ∈ F (R) (resp., σ ∈ F (R0 )) if and only if Eσ (resp., E0σ ) happens.
The proof of this lemma is somewhat tedious and follows by careful calculations. Let δ = |D(σ)| ≤ d, ω = ω(σ), and for two
non-negative integers a ≤ x, let xa = x(x − 1) · · · (x − a + 1). Then
n−ω−δ n n n−ω−δ
Pr[Eσ ] r−δ r0 r0 r−δ
= n · n−ω−δ = n · n−ω−δ
Pr E0σ 0 0
r r −δ r r −δ
(n − r)! r! (n − ω − r0 )! (r0 − δ)!
= ·
(n − r0 )! r0 ! (n − ω − r)! (r − δ)!
r! (r − δ)! (n − ω − r0 )! (n − r)!
0
= · · ·
(r − δ)! r0 ! (n − ω − r)! (n − r0 )!
0 0
rδ (n − ω − r0 )r−r rd (n − ω − r0 )r−r
= · 0 ≤ · 0 .
r0 δ (n − r0 )r−r r0 d (n − r0 )r−r
By our assumption r0 = br/tc ≥ d, so we obtain
!δ !δ !d !d
rδ r−δ+1 r−d+1 r−d+1 r
≤ ≤ ≤ ≤ = O ((t + 1)d)d = O td ,
r 0 δ r −δ+1
0 r −d+1
0 r −d+1
0 r /d
0

since r0 − d + 1 ≥ r0 /d and d is a constant. To bound the second factor, we observe that, for i = r0 , r0 + 1, . . . , r − 1,
!r−r0 r−r0 !
ω r−r
0 0
(n − ω − r0 )r−r n−ω−r+1 ω ω(r − r0 )
≤ = 1− ≤ 1− ≤ exp .
(n − r0 )r−r0 n−r+1 n−r+1 n n
Since ω ≥ t(n/r), we have ω/n ≥ t/r, and therefore
!! !!
Pr[Eσ ] −ω(r − r0 ) −t(r − r0 )
= O td exp = O td exp
Pr E0σ n r
= O(td ) exp(−(t − 1)) = O(2−t ) ,

as desired.

84
Chapter 7

Depth estimation via sampling

“Maybe the Nazis told the truth about us. Maybe the Nazis were the truth. We shouldn’t forget that: perhaps they
were the truth. The rest, just beautiful lies. We’ve sung many beautiful lies about ourselves. Perhaps that’s what I’m
trying to do - to sing another beautiful lie.”
– –The roots of heaven, Romain Gary

In this chapter, we introduce a “trivial” but yet powerful idea. Given a set S of objects, a point p that is contained in some
of the objects, and let its weight be the number of objects that contains it. We can estimate the depth/weight of p by counting the
number of objects that contains it in a random sample of the objects. In fact, by considering points induced by the sample, we can
bound the number of “light” vertices induced by S. This idea can be extended to bounding the number of “light” configurations
induced by a set of objects.
This approach leads to a sequence of short, beautiful, elegant and correct± proofs of several hallmark results in discrete
geometry.
While the results in this chapter are not directly related to approximation algorithms, the insights and general approach would
be useful for us later, or so one hopes.

7.1 The at most k-levels

S
Let L be a set of n lines in the plane. A point p ∈ `∈L ` is of level k, if there are k lines of L strictly below it. The k-level is the
closure of set of points of level k. Namely, the k-level is an x-monotone curve along the lines of L.
The 0-level are just the boundary of the “bottom” face of the arrangement of L (i.e.,
the face containing the negative y-axis). It is easy to verify that 0-level has at most n − 1
vertices, as each line might contribute at most one segment to the 0-level (which is an
unbounded convex polygon).
It is natural to ask what is the number of vertices at the k-level (i.e., what is the 3-level
combinatorial complexity of the polygonal chain forming the k-level). This is a surpris-
ingly hard question, but the question of what is the complexity of the at most k level is 1-level
considerably easier. 0-level
Theorem 7.1.1 The number of vertices of level at most k in an arrangement of n lines in
the plane is O(nk).
Proof: Pick a random sample R of L, by picking each line into the sample with probability 1/k. Observe that
n
E | R| = .
k
Let V≤k = V≤k (L) be the set of all vertices of A(L) of level at most k, for k > 1. For a vertex p ∈ V≤k , let Xp be an indicator
variable which is one if p is a vertex of the 0-level of A(R). The probability that p is in the 0-level of A(R) is the probability that
none of the j lines below it are picked to the sample, and the two lines that define it do get selected to the sample. Namely,
h i ! j !2 !k !
1 1 1 1 k 12 1
Pr Xp = 1 = 1 − ≥ 1− ≥ exp −2 = 2 2.
k k k k2 k k ek

The saying goes that “hard theorems have short, elegant and incorrect proofs”. This chapter can maybe serve as a counterex-
±

ample to this claim.

85
since j ≤ k and 1 − x ≥ e−2x , for 0 < x < 1/2.
On the other hand, the number of vertices on the 0-level of R is |R| − 1. As such,
X
Xp ≤ |R| − 1.
p∈V≤k

And this, of course, also holds in expectation, implying

 
 X  n
E  Xp  ≤ E[|R| − 1] ≤ .
p∈V
k
≤k

On the other hand, by linearity of expectation, we have

 
 X  X h i |V≤k |
E Xp  = E Xp ≥ 2 2 .
p∈V p∈V
ek
≤k ≤k

|V≤k | n
Putting these two inequalities together, we get that ≤ . Namely, |V≤k | ≤ e2 nk.
e2 k2 k
The connection to depth is simple. Every line defines a halfplane (i.e., the region above the line). A vertex of depth at most k,
is contained in at most k halfplanes. The above proof (intuitively) first observed that there are at most n/k vertices of the random
sample of zero depth (i.e., 0-level), and then showing that every vertex has probability (roughly) 1/k2 to have depth zero in the
random sample. It thus follows, that the if the number of vertices of level at most k is µ, then µ/k2 ≤ n/k; namely, µ = O(nk).

7.2 The Crossing Lemma

A graph G = (V, E) is planar, if it can be drawn in the plane so that none of its edges are crossing. We need the following result of
Euler.

Theorem 7.2.1 (Euler’s formula.) For a connected planar graph G, we have f − e + v = 2, where f, e, v are the number of faces,
edges and vertices in a planar drawing of G.

Lemma 7.2.2 If G is a planar graph, then e ≤ 3v − 6

Proof: We assume that the number of edges of G is maximal (i.e., no edges can be added without introducing a crossing). If it
is not maximal, then add edges till it becomes maximal. This implies that G is a triangulation (i.e., every face is a triangle). Then,
every face is adjacent to three edges, and as such 2e = 3 f . By Euler’s formula, we have f − e + v = (2/3)e − e + v = 2. Namely,
−e + 3v = 6. Alternatively, e = 3v − 6. However, if e is not maximal, this equality deteriorates to the required inequality.

For
example, the above inequality implies that the complete graph over 5 vertices (i.e., K5 ) is not planar. Indeed, it has
e = 52 = 10 edges, and v = 5 vertices, but if it was planar, the above inequality would imply that 10 = e ≤ 3v − 6 = 9, which is of
course false. (The reader can amuse herself by trying to prove that K3,3 , the bipartite complete graph with 3 vertices on each side,
is not planar.)
Kuratowski’s celebrated theorem states that a graph is planar if and only if it does not contain either K5 or K3,3 induced inside
it (formally, it does not have K5 or K3,3 as a minor).
For a graph G, we define the crossing number of G, denoted as c(G), as the minimal number of edge crossings in any drawing
of G in the plane. For a planar graph c(G) is zero, and it “larger” for “less planar” graphs.

Claim 7.2.3 For a graph G, we have c(G) ≥ e − 3v + 6.

Proof: If e − 3v + 6 ≤ 0 ≤ c(G) and the claim holds trivially. Otherwise, the graph G is not planar by Lemma 7.2.2. Draw G in
such a way that c(G) is realized and assume, for the sake of contradiction, that c(G) < e − 3v + 6. Let H be the graph resulting from
G, by removing from each pair of edges of G that intersects in the drawing one of the edges. We have e(H) ≥ e(G) − c(G). But H is
planar (since its drawing has no crossings), and by Lemma 7.2.2, we have e(H) ≤ 3v(H) − 6, or equivalently, e(G) − c(G) ≤ 3v − 6.
Namely, e − 3v + 6 ≤ c(G). Which contradicts our assumption.

Lemma 7.2.4 (The crossing lemma.) For a graph G, such that e ≥ v, we have c(G) = Ω(e3 /v2 ).

86
Proof: We consider a specific drawing D of G in the plane that has c(G) crossings. Next, let U be a random subset of V selected
by choosing each vertex of V to be in the sample with probability p > 0.
Let H = GU be inducted subgraph over U. Note, that only edges of G with both their endpoints in U “survive” in H.
Thus, the probability of a vertex v to survive in H is p. The probability of an edge of G to survive in H is p2 , and the probability
of a crossing (in this specific drawing D) to survive in the induced drawing DH (of H) is p4 . Let Xv and Xe denote the (random
variable which is the) number of vertices and edges surviving in H, respectively. Similarly, let Xc be the number of crossing
surviving in DH . By Claim 7.2.3, we have
Xc ≥ c(H) ≥ Xe − 3Xv + 6.
In particular, this holds in the expectation, and as such

E[Xc ] ≥ E[Xe ] − 3 E[Xv ] + 6.

By linearity of expectation, we have

c(G)p4 ≥ ep2 − 3vp + 6,
where e and v are the number of edges and vertices of G, respectively. In particular, c(G) ≥ e/p2 − 3v/p3 + 6/p4 . In particular,
setting p = v/e (we assume here that e ≥ v), we have c(G) = Ω(e3 /v2 ).
Surprisingly, despite its simplicity, Lemma 7.2.4 is a very strong tool, as the following results testify.

7.2.1 On the number of incidences

Let P be a set of n disjoint points in the plane, and let L be a set of m distinct lines in the plane (note that all the lines might
pass through a common point, as we do not assume general position here). Let I(P, L) denote the number of point/line pairs (p, `),
where p ∈ P, ` ∈ L, such that p ∈ `. The number I(P, L) is the number of incidences between lines of L and points of P. Let
I(n, m) = max|P|=n,|L|=m I(P, L).
The following “easy” result has a long history and required major effort to prove before this elegant proof was discovered .

Lemma 7.2.5 The maximum number of incidences between n points and m lines is I(n, m) = O(n2/3 m2/3 + n + m).

Proof: Let P and L be the set of n points and set of m lines, respectively, realizing I(m, n). Let G be a graph over the points of
P (we assume that P contains an additional point at infinity). We connect two points if they lie consecutively on a common line of
L. Clearly, e = e(G) = I + m and v = v(G) = n + 1, where I = I(m, n). Since we can interpret the arrangement of lines A(L) as a
drawing of G, where a crossing of two edges of G is just a vertex of A(L). As such, it follows that c(G) ≤ m2 , since m2 is a trivial
bound on the number of vertices of A(L). On the other hand, by Lemma 7.2.4, we have c(G) = Ω(e3 /v2 ). Thus,
(I + m)3 e3
= 2 = O(c(G)) = O(m2 ).
(n + 1)2 v
Assuming I ≥ m and I ≥ n, we have I = O(m2/3 n2/3 ). Or alternatively, I = O(n2/3 m2/3 + m + n).

7.2.2 On the number of k-sets

Let P be a set of n points in the plane in general position (i.e., no three points are collinear). A pair of points p, q ∈ P form a k-set
if there are exactly k points in the (closed) halfplane below the line passing through p and q. Consider the graph G = (P, E) that
has an edge for every k-set. We will be interested in bounding the size of E as a function of n. Observe, that via duality, it is easy

to observe that the number of k-sets, is exactly the complexity of the k-level in the dual arrangement A P? .
q
Lemma 7.2.6 (Antipodality.) Let qp and rp be two k-set edges of G, with q and r to the left
of p. Then there exists a point s ∈ P to the right of p such that ps is a k-set, and line(p, s) lies s
between line(q, p) and line(r, p). r p
Proof: Let f (α) be the number of points below or on the line passing through p and having slope α, where α is a real number.
Rotating this line counterclockwise around p corresponds to increasing α. In the following, let f+ (x) (resp., f− (x)) denote the value
of f (·) just to the right (resp., left) of x.
Any point swept over by this line which is to the right of p increases f , and any point swept over to the left of p decreases f
by one.
Let αq and αr be the slope
of the lines containing
qp and rp, respectively. Assume, for the sake of simplicity of exposition,
that αq < αr . Clearly, f αq = f (αr ) = k and f+ αq = k − 1. Let y be the smallest value such that y > αq and f (y) = k. We have

Or invented – I have no dog in this argument.

87
that f− (y) = k − 1, which implies that the line passing through p with slope f (y) has a point s ∈ P, on it, and s is to the right of p.
Clearly, if we continue sweeping, the line would sweep over rp, which implies the claim.
Lemma 7.2.6 also holds by symmetry in the other direction: Between any two edges to the right of p, there is an antipodal
edge on the other side.

Lemma 7.2.7 Let p be a point of P, and let q be a point to its left, such that qp ∈ E(G) and it has the largest slope among all such
edges. Furthermore, assume that there are k − 1 points of P to the right of p. Then, there exists a point r ∈ P, such that pr ∈ E(G)
and pr has larger slope than qp.

Proof: Let α be the slope of qp, and observe that f (α) = k and f+ (α) = k − 1, and f (∞) ≥ k. Namely, there exists y > α such
that f (y) = k. We conclude that there is k-set adjacent to p on the right, with slope larger than α.
So, imagine that we are at an edge e = qp ∈ E(G), where q is to the left of p. We
rotate a line around p (counterclockwise) till we encounter an edge e0 = pr ∈ E(G), where
r is a point to the right of p. We can now walk from e to e0 , and continue walking in this
way, forming a chain of edges in G. Note, that by Lemma 7.2.6, no two such chains can
be “merged” into using the same edge. Furthermore, by Lemma 7.2.7, such a chain can
end only in the last k − 1 points of P (in their ordering along the x-axis). Namely, we
decomposed the edges of G into k − 1 edge disjoint convex chains (the chains are convex
since we rotate clockwise as we walk along a chain). The picture on the right shows the
5-sets and their decomposition into 4 convex chains.

Lemma 7.2.8 The edges of G can be decomposed into k − 1 convex chains C1 , . . . , Ck−1 .
Similarly, the edges of G can be decomposed into m = n − k + 1 concave chains
D1 , . . . , Dm .

Proof: The first part of the claim is proved above. As for the second claim, rotate the plane
by 180◦ . Every k-set is now (n − k + 2)-set, and by the above argumentation, the edges
of G can be decomposed into n − k + 1 convex chains, which are concave in the original
orientation.

Theorem 7.2.9 The number of k-sets defined by a set of n points in the plane is O nk1/3 .

Proof: The graph G has n = |P| vertices, and let m = |E(G)| be the number of k-sets. By Lemma 7.2.8, any crossing of two
edges of G, is an intersection point of one convex chain of C1 , . . . , Ck−1 with a concave chain of D1 , . . . , Dn−k+1 . Since a convex
chain and a concave chain can have at most two intersections, we conclude
that there are at most 2(k − 1)(n − k + 1) crossings in G.
By the Crossing Lemma (Lemma 7.2.4), there are at least Ω m3 /n2 crossings. Putting this two inequalities together, we conclude

m3 /n2 = O(nk), which implies m = O nk1/3 .

7.3 A general bound for the at most k-weight

We now extend the at most k-level technique to the general moments technique settings (see ). We quickly restate the abstract
settings.
S
Let S be a set of objects. For a subset R ⊆ S, we define a collection of ‘regions’ called F (R). Let T = T (S) = R⊆S F (R)
denote the set of all possible regions defined by subsets of S. We associate two subsets D(σ), K(σ) ⊆ S with each region σ ∈ T .
The defining set D(σ) of σ is a subset of S defining the region σ. We assume that for every σ ∈ T , |D(σ)| ≤ d for a (small) constant
d, which is the combinatorial dimension. The killing set K(σ) of σ is the set of objects of S such that including any object of K(∆)
into R prevents σ from appearing in F (R). The weight of σ is ω(σ) = |K(σ)|.
Let S, F (R) , D(σ), and K(σ) be such that for any subset R ⊆ S, the set F (R) satisfies the following axioms: (i) For any
σ ∈ F (R), it holds D(σ) ⊆ R and R ∩ K(σ) = ∅. (ii) If D(σ) ⊆ R and K(σ) ∩ R = ∅, then σ ∈ F (R).
Let T≤k (S) be the set of regions of T with weight at most k. Furthermore, assume that the expected number of regions of zero
weight of a sample of size r is at most f (r). We have the following theorem.

Theorem 7.3.1 Let S be a set of n objects as above, with combinatorial dimension d, and let k be a parameter. Let R be a random
sample created by picking each element of S with probability 1/k. Then, we have
h i
|T≤k (S)| ≤ c E kd f (|R|) ,

for a constant c.

88
d k −2 d
Proof: We reproduce the proof of Theorem
Every region σ ∈ T≤k appears in F (R) with probability ≥ 1/k (1−1/k) ≥ e /k .
7.1.1.
d 2
As such, E f (|R|) ≥ E[|F (R)|] ≥ |T≤k |/ k e .

Lemma 7.3.2 Let f (·) be a monotone increasing function which is well behaved; namely, that there exists a constant c, such that
f (xr) ≤ c f (r), for any r and 1 ≤ x ≤ 2. Let Y be the number of heads in n coin-flips where the probability for head is 1/k. Then

E f (Y) = O( f (n/k)).

Proof: This follows easily from Chernoff inequality. Indeed,

X
k
E f (Y) ≤ f (10(n/k)) + f ((t + 1)k) Pr[Y ≥ t(n/k)]
t=10
X
k
≤ O( f (n/k)) + cdlg t+1e f (n/k)2−t(n/k) = O( f (n/k)) ,
t=10

by the simplified form of Chernoff inequality, see Theorem 25.2.6.

The following is an immediate consequence of Theorem 7.3.1 and Lemma 7.3.2.

Theorem 7.3.3 Let S be a set of n objects, with combinatorial dimension d, and let k be a parameter. Assume that the number
of regions formed by a set of m objects is bounded by a function f (m), and furthermore, f (m) is well behaved in the sense of
Lemma 7.3.2. Then, |T≤k (S)| = O kd f (n/k) .

Note, that if the function f (·) grows polynomially then Theorem 7.3.3 applies. It fails if f (·) grows exponentially.
We need the following fact, which we state without proof.

Theorem
7.3.4 (The Upper Bound Theorem.) The complexity of the convex-hull of n points in d dimensions is bounded by
O nbd/2c .

Example 7.3.5 (At most k-sets.) Let P be a set of n points in IRd . A region here is a halfspace with d points on its boundary. The
set of regions
defined
by P is just the faces of the convex hull of P. The complexity of the convex hull of n points in d dimensions is
f (n) = O nbd/2c , by Theorem 7.3.4. Two halfspaces h, h0 would be considered to be combinatorially different if P ∩ h , P ∩ h0 . As

such, the number of combinatorially different halfspaces containing at most k points of P is at most O kd f (n/k) = O kdd/2e nbd/2c .

7.4 Bibliographical notes

The reader should not mistaken the simplicity of the proofs in this chapter with easiness. Almost all the results presented, in this
chapter, have long and painful history with earlier results which were technically much more involved (and weaker). In some sense,
these results are the limit of mathematical evolution: They are simple, (in some cases) breathtakingly elegant, and on their own
(without exposure to previous work on the topic), it is inconceivable that one can come up with them.

At most k-level. The technique for bound the complexity of the at most k-level (or at most depth k) is generally attributed to
Clarkson-Shor [CS89] and more precisely it is from [Cla88]. Previous work on just the two dimensional variant include [GP84,
Wel86, AG86]. Our presentation in Section 7.1 and Section 7.3 follows (more or less) Sharir [Sha03]. The connection of this
technique to the crossing lemma is from there.
For a proof of th Upper Bound Theorem (Theorem 7.3.4), see Matoušek [Mat02].

The crossing lemma. The crossing lemma is originally by Ajtai et al. [ACNS82] and Leighton [Lei84]. The current greatly
simplified “proof from the book” is attributed to Sharir. The insight that this lemma has something to do with incidences and similar
problems is due to Székely [Szé97]. Elekes [Ele97] used the crossing lemma to prove surprising lower bounds on sum and products
problems.

The complexity of k-level and number of k-sets. This is considered to be one of the hardest problems in discrete
geometry, and there is still a big gap between the best lower bound [Tót01] and best upper bound currently known [Dey98]. Our
presentation in Section 7.2.2 follows suggestions by Micha Sharir, and is based on the result of Dey [Dey98] (which was in turn
inspired by the work of Agarwal et al. [AACS98]). This problem has long history, and the reader is referred to Dey [Dey98] for its
history.

89
Incidences. This problem again has long and painful history. The reader is referred to [Szé97] for details.
We only skimmed the surface of some problems in discrete geometry and results known in this field related to incidences and
k-sets. Good starting points for learning more are the books by Brass et al. [BMP05] and Matoušek [Mat02].

90
Chapter 8

Approximating the Depth via Sampling and

Emptiness

As far as he personally was concerned there was nothing else for him to do except either shoot himself as soon
as he came home or send for his greatcoat and saber from the general’s apartment, take a bath in the town baths,
stop at Volgruber’s wine-cellar l afterwords, put his appetite in order again and book by telephone a ticket for the
performance in the town theater that evening.
– – The good soldier Svejk, Jaroslav Hasek

8.1 From Emptiness to Approximate Range Counting

Assume that there exists a data structure that can be constructed in T (n) time for a set S of n objects such that, given a query range
r, we can check in Q(n) time whether r intersects any of the objects in S. Let S (n) be the space required to store this data structure.
In this section, we show how to build a data structure that quickly returns an approximate number of objects in S intersecting r
using emptiness testing as a subroutine.
In particular, let µr = depth(r, S) denote the depth of r; namely, its the number of objects of S intersected by r. Below we
use ε > 0 to denote the required approximation quality; namely, we would like the data structure to output a number αr such that
(1 − ε)µr ≤ αr ≤ µr .

8.1.1 The decision procedure

Given parameters z ∈ [1, n] and ε, with 1/2 > ε > 0, we construct a data structure,
such that given a query range r, we can decide
whether µr < z or µr ≥ z. The data structure is allowed to make a mistake if µr ∈ (1 − ε)z, (1 + ε)z .

The data structure. Let R1 , . . . , RM be M independent random samples of S, formed by picking every element with proba-
bility 1/z, where l m
M = ν(ε) = c2 ε−2 log n ,
and c2 is a sufficiently large absolute constant. Build M separate emptiness-query data structures D1 , . . . , D M , for the sets R1 , . . . , R M ,
respectively, and put D = D(z, ε) = {D1 , . . . , D M }.

Answering a query. Consider a query range r, and let Xi = 1 if r intersects any of the objects of Ri and Xi = 0 otherwise, P
for
i = 1, . . . , N = ν(ε). The value of Xi can be determined using a single emptiness query in Di , for i = 1, . . . , M. Compute Yr = i Xi .
For a range σ of depth k, the probability that σ intersects one of the objects of Ri is
!k
1
ρ(k) = 1 − 1 − . (8.1)
z

If a range σ has depth z, then Λ = E[Yσ ] = ν(ε)ρ(z). Our data structure returns “depth(r, S) < z” if Yr < Λ, and “depth(r, S) ≥ z”
otherwise.

91
8.1.1.1 Correctness
In the following, we show that, with high probability, the data structure indeed returns the correct answer if the depth of the query
range is outside the “uncertainty” range [(1 − ε)z, (1 + ε)z]. For simplicity of exposition, we assume in the following that z ≥ 10 (the
case z < 10 follows by similar arguments). Consider a range r of depth at most (1 − ε)z. The data structure returns wrong answer
if Yr > Λ. We will show that the probability of this event is polynomially small. The other case, where r has depth at least (1 + ε)z
but Yr < Λ is handled in a similar fashion.

Intuition. Before jumping into the murky proof, let us consider the situation. Every sample Ri is an experiment. The experiment
succeeds if the sample contains an object that intersects the query range r. The probability of success is ρ(z), see Eq. (8.1), where
z is the weight of r. Now, if there is a big enough gap between ρ((1 − ε)z) and ρ(k), then we could decide if the range is “heavy”
(i.e., weight exceeding z) or “light” (i.e., weight smaller than (1 − ε)z) by estimating the probability γ that r intersects an object in
the random sample.
Indeed, if r is “light” then γ ≤ ρ((1 − ε)z), and if it is “heavy” then γ ≥ ρ(z). We estimate γ by the quantity Yr /M; namely,
repeating the experiment M times, and dividing the number of successes by the number of experiments (i.e., M). Now, we need
to determine how many experiments we need to perform till we get a good estimate, which is reliable enough to carry out our
nefarious task of distinguishing the light case from the heavy case. Clearly, the bigger the gap is between ρ((1 − ε)z) and ρ(z), the
fewer experiments required. Our proof would first establish that the gap between these two probabilities is Ω(ε), and next we will
plug this into Chernoff inequality to figure out how large M has to be for this estimate to be reliable.

To estimate this gap, we need to understand how the func- 1

tion ρ(·) looks like. So consider the graph on the right. Here ρ(k) = 1 − (1 − 1/100)k
z = 100 and ε = 0.2. For the heavy case, where the weight of
0.8
range exceeds z, the probability of success is p = ρ(100), and
probability of failure is q = ρ(80). Since the function behaves p
like a line around these values, its kind of visually obvious that 0.6
the required gap (the vertical blue segment in the graph) is ≈ ε.
q
Proving this formally is somewhat more tedious.
In the following, we need the following two easy observa- 0.4
tions.
0.2
1−x
Observation 8.1.1 For 0 ≤ x ≤ y < 1, we have = 1+
1−y
y−x 0
≥ 1 + y − x. 0 50 75 100 150 200
1−y

Observation 8.1.2 For x ∈ [0, 1/2], it holds exp(−2x) ≤ 1 − x. Similarly, for x ≥ 0, we have 1 − x ≤ exp(−x) and 1 + x ≤ exp(x).

Lemma 8.1.3 Let Λ = E[Yσ ] = Mρ(z). We have

α = Pr Yr > Λ depth(r, S) ≤ (1 − ε)z ≤ 1 ,
nc4
where c4 = c4 (c2 ) > 0 depends only on c2 and can be made arbitrarily large by a choice of a sufficiently large c2 .

Proof: The probability α is maximized when depth(r, S) = (1 − ε)z. Thus

α ≤ Pr Yr > Λ | depth(r, S) = (1 − ε)z .

Observe that Pr[Xi = 1] = ρ (1 − ε)z , so

 !(1−ε)z 
 1  M
E[Yr ] = M ρ((1 − ε)z) = M · 1 − 1 −  ≥ M · 1 − e−(1−ε) ≥ ,
z 3

since 1 − 1/z ≤ exp(−1/z) and ε ≤ 1/2. By definition, Λ = M ρ(z), therefore, by Observation 8.1.1, we have
z !(1−ε)z !z !(1−ε)z !εz !
Λ 1 − 1 − 1z 1 1 1 1
ξ= = ≥1+ 1− − 1− =1+ 1− 1− 1− .
E[Yr ] 1 − 1 − 1 (1−ε)z z z z z
z

92
Now, by applying Observation 8.1.2 repeatedly, we have
! !!
2 1 1
ξ ≥ 1 + exp − (1 − ε)z · 1 − exp − εz = 1 + 2 1 − exp(−ε)
z z e
1 ε ε
≥1+ 2 1− 1− ≥1+ .
e 2 15
Deploying the Chernoff inequality (Theorem 25.2.6), we have that if µr = depth(r, S) = (1 − ε)z then

α = Pr[Yr > Λ] ≤ Pr Yr > ξ E[Yr ] ≤ Pr Yr > (1 + ε/15) E[Yr ]
! !  2 l −2 m
1 ε 2 Mε2  ε c2 ε log n 

≤ exp − E[Yr ] ≤ exp − ≤ exp−  ≤ n−c4 ,
4 15 c3 c3

where c3 is some absolute constant, and by setting c2 to be sufficiently large.

This implies the following lemma.

Lemma 8.1.4 Given a set S of n objects, a parameter 0 < ε < 1/2, and z ∈ [0, n], one can construct a data structure D which,
given a range r, returns either  or . If it returns , then µr ≤ (1 + ε)z, and if it returns  then µr ≥ (1 − ε)z. The data structure
might return either answer if µr ∈ [(1 − ε)z, (1 + ε)z].
The data structure D consists of M = O(ε−2 log n) emptiness data structures. The space and preprocessing time needed to
build them are O(S (2n/z)ε−2 log n) where S (m) is the space (and preprocessing time) needed for a single emptiness data structure
storing m objects.
The query time is O(Q(2n/z)ε−2 log n), where Q(m) is the time needed for a single query in such a structure, respectively. All
bounds hold with high probability.

Proof: The lemma follows immediately from the above discussion. The only missing part is observing that by the Chernoff
inequality we have that |Ri | ≤ 2n/z, and this holds with high probability.

8.1.2 Answering approximate counting query

The path to answering approximate counting query is now clear. We use Lemma 8.1.4 to perform binary search, repeatedly
narrowing the range containing the answer. We stop when the size of the range is within our error tolerances. At the start of the
process, this range is large, so we use large values of ε. As the range narrows ε is reduced.

8.1.3 The data structure

Lemma 8.1.4 provides us with a tool for performing a “binary” search for the count value µr of a range r. For small values of i, we
just build a separate data structure of Lemma 8.1.4 for depth values i/2, for i = 1, . . . , U = O(ε−1 ). For depth i, we use accuracy
1/8i (i.e., this is the value of ε when using Lemma 8.1.4). Using these data structures, we can decide whether the query range count
is at least U, or smaller than U. If it is smaller than U, then we can perform a binary search to find its exact value. The result is
correct with high probability.
Next, consider the values v j = (U/4)(1+ε/16) j , for j = U+1, . . . , W, where W = c log1+ε/16 n = O(ε−1 log n), for an appropriate
choice of an absolute constant c > 0, so that vW = n. We build a data structure D(v j ) for each z = v j , using Lemma 8.1.4.

Answering a query. Given a range query r, each data structure in our list returns  or . Moreover, with high probability, if
we were to query all the data structures, we would get a sequence of s, followed by a sequence of s. It is easy to verify that the
value associated with the last data structure returning  (rounded to the nearest integer) yields the required approximation. We can
−1
use binary search on D(v1 ), . . . , D(vW ) to locate this changeover
value using a total of O(log W) = O(log(ε log n)) queries in the
structures of D! , . . . , DW . Namely, the overall query time is O Q(n)ε−2 (log n) log(ε−1 log n) .

Theorem 8.1.5 Given a set of S of n objects, and assume that one can construct, using S (n) space, in T (n) time, a data structure
that answers emptiness queries in Q(n) time.
Then, one can construct, using O S (n)ε−3 log2 n space, in O(T (n)ε−3 log2 n) time, a data structure that, given a range r,

outputs a number αr , with (1 − ε)µr ≤ αr ≤ µr . The query time is O ε−2 Q(n)(log n) log(ε−1 log n) . The result returned is correct
with high probability for all queries and the running time bounds hold with high probability.

The bounds of Theorem 8.1.5 can be improved, see Section 8.4 for details.

93
8.2 Application: halfplane and halfspace range counting
Using the data structure of Dobkin and Kirkpatrick [DK85], one can answer emptiness halfspace range searching queries in loga-
rithmic time. In this case, we have S (n) = O(n), T (n) = O(n log n), and Q(n) = O(log n).

Corollary 8.2.1 Given a set P of n points in two (resp., three) dimensions, and a parameter ε > 0, one can construct in
O(npoly(1/ε, log n)) time a data structure, of size O(npoly(1/ε, log n)), such that given a halfplane (resp. halfspace) r, it out-

puts a number α, such that (1 − ε) |r ∩ P| ≤ α ≤ |r ∩ P|, and the query time is O poly(1/ε, log n) . The result returned is correct
with high probability for all queries.

Using the standard lifting of points in IR2 to the paraboloid in IR3 implies a similar result for approximate range counting for
disks, as a disk range query in the plane reduces to a halfspace range query in three dimensions.

Corollary 8.2.2 Given a set of P of n points in two dimensions, and a parameter ε, one can construct a data structure in

O npoly(1/ε, log n) time, using O npoly(1/ε, log n) space, such that given a disk r, it outputs a number α, such that (1−ε) |r ∩ P| ≤

α ≤ |r ∩ P|, and the query time is O poly(1/ε, log n) . The result returned is correct with high probability for all possible queries.

Depth queries. By computing the union of a set of n pseudodisks in the plane, and preprocessing the union for point-location
queries, one can perform “emptiness” queries in this case in logarithmic time. (Again, we are assuming here that we can perform
the geometric primitives on the pseudodisks in constant time.) The space needed is O(n) and it takes O(n log n) time to construct it.
Thus, we get the following result.

Corollary 8.2.3 Given a set of S of n pseudodisks in the plane, one can preprocess them in O(nε−2 log2 n) time, using O(nε−2 log n)
space, such
that given
a query point q, one can output a number α, such that (1 − ε)depth(p, S) ≤ α ≤ depth(p, S), and the query
time is O ε−2 log2 n . The result returned is correct with high probability for all possible queries.

8.3 Relative approximation via sampling

In the above discussion, we had used several random samples of a set of n objects S, and by counting in how many samples a query
point lies in, we got a good estimate of the depth of the point in S. It is natural to ask what can be done if we insist on using a single
sample. Intuitively, if we sample each object with probability p in a random sample R, then if the query point r had depth that is
sufficiently large (roughly, 1/(pε2 )) then its depth can be estimated reliably by counting the number of objects in R containing r,
and multiplying it by 1/p. The interesting fact is that the deeper r is the better this estimate is.

Lemma 8.3.1 ((Reliable sampling.)) Let S be a set of n objects, 0 < ε < 1/2, and let r be a point of depth u ≥ k in S. Let R be a
random sample of S, such that every element is picked into the sample with probability
8 1
p= ln .
kε2 δ
Let X be the depth of r in R. Then, with probability ≥ 1 − δ, we have that estimated depth of r, that is X/p, lies in the interval
[(1 − ε)u, (1 + ε)u].
In fact, this estimates succeeds with probability ≥ 1 − δu/k .

Proof: We have that µ = E[X] = pu. As such, by Chernoff inequality (Theorem 25.2.6 and Theorem 25.2.9), we have

Pr X < (1 − ε)µ, (1 + ε)µ = Pr X < (1 − ε)µ + Pr X > (1 + ε)µ

≤ exp −puε2 /2 + exp −puε2 /4
! !
1 1
≤ exp −4 ln + exp −2 ln ≤ δ,
δ δ
since u ≥ k.
Note, that if r depth in S is (say) u ≤ 10k then the depth of r in sample is (with the stated probabilities)
!
1 1
depth(r, R) ≤ (1 + ε)pu = O 2 ln .
ε δ
Which is (relatively) a small number. Namely, via sampling, we turned the task of estimating the depth of heavy ranges, into the
task of estimating the depth of a shallow range. To see why this is true, observe that we can perform a binary (exponential) search
for the depth of r by a sequence of coarser to finer samples.

94
8.4 Bibliographical notes
The presentation here follows the work by Aronov and Har-Peled [AH05]. The basic idea is folklore and predates this paper, but the
formal connection between approximate counting to emptiness is from this paper. One can improve the efficiency of this reduction
by being more careful, see the full version of [AH05] for details. Followups to this work include [KS06, AC07, AHS07].

95
96
Chapter 9

Linear programming in Low Dimensions

At the sight of the still intact city, he remembered his great international precursors and set the whole place on fire
with his artillery in order that those who came after him might work off their excess energies in rebuilding.
– – The tin drum, Gunter Grass

In this chapter, we shortly describe (and analyze) a simple randomized algorithm for linear programming in low dimensions.
Next, we show how to extend this algorithm to solve linear programming with violations. Finally, we would show how one can
efficiently approximate the number constraints one need to violate to make a linear program feasible. This serves as a fruitful
ground to demonstrate some the techniques we visited already.
Our discussion is going to be somewhat intuitive. We will fill in the details, and prove correctness formally of our algorithms
in the next chapter.

9.1 Linear Programming

Assume we are given a set of n linear inequalities defined of the form a1 x1 + · · · + ad xd ≤ b, where a1 , . . . , ad , b are constants, and
x1 , . . . , xd are the variables. In the linear programming (LP) problem, one has to find a feasible solution; that is, a point (x1 , . . . , xd )
for which all the linear inequalities hold. In fact, usually we would like to find a feasible point that maximizes a linear expression
(referred to as the target function of the LP) of the form c1 x1 + · · · + cd xd , where c1 , . . . , cd are prespecified constants.
3y
The set of points complying with a linear inequality a1 x1 + · · · + ad xd ≤ b, is just a + 2x
halfspace of IRd , having the hyperplane a1 x1 + · · · + ad xd = b as a boundary, see figure on =
6
the right. As such, the feasible region of the LP is the intersection of n halfspaces; that is,
it is a polyhedron. The linear target function is no more than specifying a direction, such
that we need to find the point inside the polyhedron which is extreme in this direction. If 3y + 2x ≤ 6
the polyhedron is unbounded in this direction, the optimal solution is unbounded.
For the sake of simplicity of exposition, it would be easiest to think on the direction
that one has to optimize for as the negative xd -axis direction. This can be easily realized Feasible region
by rotating space such that the required direction is pointing downward. Since the feasible
region is the intersection of convex sets (i.e., halfspaces), it is convex. As such, one can
imagine the boundary of the feasible region as vessel (with a convex interior). Next, we
release a ball at the top of vessel, and the ball roll down (by “gravity” in the direction of
the negative xd -axis) till it reaches the lowest point in the vessel and get “stuck”. This
point is the optimal solution to the LP that we are interested in computing.
In the following, we will assume that the given LP is in general position. Namely, if
we intersect k hyperplanes, induced by k inequalities in the given LP, then their intersection is d − k dimensional affine subspace.
In particular, intersection of d of them is a point (referred to as a vertex).
Similarly, intersection of any d + 1 of them is empty.
A polyhedron defined by a LP with n constraints might has O nbd/2c vertices on its boundary (this is known as the upper-bound
theorem [Grü03]). As we argue below, the optimal solution is a vertex. As such a naive algorithm would enumerate all relevant
vertices (this is a non-trivial undertaking) and return the best possible vertex. Surprisingly, in low dimension, one can do much
better, and get an algorithm with linear running time.
The fact we are interested in the best vertex of the feasible region, while this polyhedron is defined implicitly as the intersection
of halfspaces also hints into the quandary that we are in: We are looking for an optimal vertex in a large graph that is defined im-
plicitly. Intuitively, this is why proving the correctness of the algorithms we present here is a non-trivial undertaking (as mentioned
before, we will prove correctness in the next chapter).

97
9.1.1 A solution, and how to verify it
Observe that an optimal solution of a LP is either a vertex or unbounded. Indeed, if the optimal solution p lies in the middle of
a segment s, such that s is feasible, then either one of its endpoints provide a better solution (i.e., one of them is lower in the xd
direction than p), or both endpoints of s have the same target value. But then, we can move the solution to one of the endpoints of
s. In particular, if the solution lies on a k-dimensional facet F of the boundary of the feasible polyhedron (i.e., formally F is a set
with affine dimension k formed by intersection the boundary of the polyhedron by a hyperplane), we can move it so that it lies on a
(k − 1)-dimensional facet F 0 of the feasible polyhedron, using the proceedings argumentation. Using it repeatedly, one ends up in a
vertex of the polyhedron, or in an unbounded solution.
Thus, given an instance of LP, the LP solver should output one of the following answers.
(A) Finite. The optimal solution is finite, and the solver would provides a vertex which realizes the optimal solution.
(B) Unbounded. The given LP has an unbounded solution. In this case, the LP would output a ray ζ, such that the ζ lies inside
the feasible region, and it points downward the negative xd -axis direction.
(C) Infeasible. The given LP does not have any point which comply with all the given inequalities. In this case the solver would
output d + 1 constraints which are infeasible on their own.

Lemma 9.1.1 Given a set of d linear inequalities in IRd , one can compute the vertex formed by the intersection of their boundaries
in O(d3 ) time.

Proof: Write down the system of equalities that the vertex must fulfil. Its a system of d equalities in d variables and it can be solved
in O(d3 ) time using Gaussian elimination.
A cone is the intersection of d constraints, where its apex is the vertex associated with this set of constraints. A set of such d
constraints is a basis. An intersection of d − 1 of the hyperplanes of a basis form a line and clipping this line to the cone of the basis
form a ray. Clipping the same line to the feasible region would yield either a segment, referred to as an edge of the polytope, or a
ray. An edge of the polyhedron connects two vertices of the polyhedron As such, one can think about the boundary of the feasible
region as inducing a graph – its vertices are the vertices of the polyhedron, and the edges of the polyhedron. Since every vertex has
d hyperplanes
defining it (its basis), and an adjacent edge is defined by d − 1 of these hyperplanes, it follows that each vertex has
d
d−1
= d edges adjacent to it.
The following lemma tells us when we have an optimal vertex. While it is intuitively clear, its proof requires a systematic
understanding of how the feasible region of a linear program looks like, and we delegate it to the next chapter.

Lemma 9.1.2 Let L be a given linear program, and let P denote its feasible region. Let v be a vertex P , such that all the d rays
emanating from v are in the upward xd -axis direction, then v is the lowest (in the xd -axis direction) point in P and it is thus the
optimal solution to L.

Interestingly, when we are at vertex of v of the feasible region, it is easy to find the adjacent vertices. Indeed, compute the d
rays emanating from v. For such a ray, intersect it with all the constraints of the LP. The closest intersection point along this ray is

the vertex u of the feasible region adjacent to v. Doing this naively takes O dn + dconst time.
Lemma 9.1.2 offers a simple algorithm for computing the optimal solution for an LP. Start from a feasible vertex of the LP.
As long as this vertex has at least one ray that points downward, follow this ray to adjacent vertex on the feasible polytope that is
lower than the current vertex (i.e., compute the d rays emanating from the current vertex, and follow one of the rays that points
downward, till you hit a new vertex). Repeat this till the current vertex has all rays pointing upward, by Lemma 9.1.2 this is the
optimal solution. Up to tedious (and non-trivial) details this is the simplex algorithm.
We need also the following lemma, which its proof is delegated to the next chapter.

Lemma 9.1.3 If L is a LP in d dimensions which is not feasible, then there exists d + 1 inequalities in L which are infeasible on
their own.

Note, that given a set of d + 1 inequalities, its easy to verify if it feasible or not. Indeed, compute the d+1
d
vertices formed by
this set of constraints, and check whether any of this vertices are feasible. If all of them are infeasible, then this set of constraints is
infeasible.

9.2 Low Dimensional Linear Programming

9.2.1 An algorithm for a restricted case
There are a lot of tedious details that one has to take care of to make things work with linear programming. As such, we will first
describe the algorithm for a special case, and then provide the envelope required so that one can use it to solve the general case.

98
We remind the reader that the input to the algorithm is the LP L which is defined by a set of n linear inequalities in IRd . We
are looking for the lowest point in IRd which is feasible for L.
A vertex v is acceptable if all the d rays associated with it points upward (note, that the vertex itself might not be feasible).
The optimal solution (if it is finite) must be located at an acceptable vertex. Assume that we are given the basis B = {h1 , . . . , hd } of
such an acceptable vertex. Let hd+1 , . . . , hm be a random permutation of the remaining constraints of the LP L.
Our algorithm is randomized incremental. At the ith step, for i ≥ d, it would maintain he optimal solution for the first i
constraints. As such, in the ith step, the algorithm checks whether the optimal solution vi−1 of the previous iteration is still feasible
with the new constraint hi (namely, the algorithm checks if vi is inside the halfspace defined by hi ). If vi−1 is still feasible, then it is
still the optimal solution, and we set vi ← vi−1 .
The more interesting case, is when vi−1 < hi . First, we check if the basis of vi−1 together with hi form a set of constraints which
is infeasible. If so, the given LP is infeasible, and we output B(vi−1 ) ∪ {hi } as our proof of infeasibility.
Otherwise, the new optimal solution must lie on the hyperplane associated with hi . As such, we recursively compute the lowest
T
vertex in the (d − 1)-dimensional polyhedron (∂hi ) ∩ i−1 j=1 h j . This is a linear program involving i − 1 constraints, and it involves
d − 1 variables since it lies on the (d − 1)-dimensional hyperplane ∂hi . The solution found vi is defined by a basis of d − 1 constraints,
and adding hi to it, results in an acceptable vertex that is feasible, and we continue to the next iteration.
Clearly, the vertex vn is the required optimal solution.

9.2.1.1 Running time analysis

Checking if a set of d + 1 constraints is infeasible takes O d4 time. The bad case for us, is when the vertex vi is recomputed in
the ith iteration. But this happens only if hi is one of the d constraints in the basis of vi . Since there are most d constraints that
define the base, and there are at least i − d constraints that are being randomly ordered (as the first d slots are fixed), we have that
the probability that vi , vi1− is !
d 2d
αi ≤ min , 1 ≤ ,
i−d i
for i ≥ d + 1, as can be easily verified.® So, let T (m, d) be the expected time to solve an LP with n constraints in d dimensions, we
have
X m X m
2d
T (m, d) ≤ O md3 + αi (di + T (i, d − 1)) = O md3 + T (i, d − 1).
i=d+1 i=d+1
i
Guessing that T (n, d) ≤ cd n, we have that
Xm
2d
T (m, d) ≤ c1 md3 + cd−1 i = c1 d3 + 2dcd−1 m,
i=d+1
i

where c1 is some absolute constant. We need that

c1 d3 + 2cd−1 d ≤ cd ,

d d
Which holds for cd = O (3d) , and T (m, d) = O (3d) m .

Lemma 9.2.1 Given an LP with

n constraints in d dimensions, and an acceptable vertex for this LP, then can compute the optimal
solution in expected O (3d)d n time.

9.2.2 The algorithm for the general case

Let L be the given LP, and let L b be the instance formed by translating all the constraints so that they pass through the origin. Next,
let h be the hyperplane xd = −1. Consider a solution to the LP L b when restricted to h. This is a (d − 1)-dimensional instance of
linear programming, and it can be solved recursively.
b ∩ h returned no solution, then the d constraints that prove that the LP L
If the recursive call on L b is infeasible on h, corresponds
to a basis in L of a vertex which is acceptable. Indeed, as we move these d constraints to the origin, their intersection is empty
with h (i.e., the “quadrant” that their intersection forms is unbounded only in the upward direction). As such, we can now apply the
algorithm of Lemma 9.2.1 to solve the given LP.
b ∩ h, then it is a vertex v on h which is feasible. Thus, consider the original set of d − 1 constraints in
If there is a solution to L
L that corresponds to the basis B of v. Let ` be the line formed by the intersection of the hyperplanes of B. Its now easy to verify
that the intersection of the feasible region with this line is an unbounded ray, and the algorithm returns this unbounded (downward
oriented) ray, as a proof that the LP is unbounded.

Theorem 9.2.2 Given a LP with n constraints defined over d variables, it can be solved in expected O (3d)d m time.

(d)+d d d
®
Indeed, (i−d)+d
lies between i−d
and d
= 1.

99
Proof: The expected running time is
S (m, d) = O(md) + S (m, d − 1) + T (m, d),

where T (m, d) is the time to solve a LP in the restricted case of Section 9.2.1. The solution to this recurrence is O (3d)d m , see
Lemma 9.2.1.

9.3 Linear Programming with Violations

Let L be a linear program with d variables, and k > 0 be a parameter. We are interested in the optimal solution of L if we are
allowed to throw away k constraints. A naive solution would be to try and throw away all possible subsets of k constraints, solve
each one of these instances and return the best solution found. This would require O nk+1 time. Luckily, it turns out that one can
do much better if the dimension is small enough.
The idea is the following: The vertex realizing the optimal k-violated solution is a vertex v defined by d constraints (let it basis
be B), and is of depth k. We remind the reader that a point p has depth k (in L), if it is outside k halfspaces of L (namely, we
complement each constraint of L, and p is contained inside k of these complemented hyperplanes) . As such, we can use the depth
estimation technique we encountered before, see Chapter 7. Specifically, if we pick each constraint of L into a new instance of LP
with probability 1/k, then the probability the new instance L b would have all the elements of B in its random sample, and will not
contain any of the k constraints opposing v is
!|B| !depth(p) !k !
1 1 1 1 1 2 1
α= 1− ≥ d 1− ≥ d exp − k ≥ d ,
k k k k k k 8k

since 1 − x ≥ e−2x , for 0 < x < 1/2. If this happens then the optimal solution for L b is v. This can be verified by computing how
many constraints of L the optimal solution of L b violates. If it violates more than k constraints we ignore it. Otherwise, we return
this as our candidate solution.
Next, we amplify the probability of success by repeating this process M = 8kd ln(1/δ) times, returning the best solution found.
The probability that in all these (independent) iterations we had failed to generate the optimal (violated) solution is at most
!M M !!
M 1 1
(1 − α) ≤ 1 − d ≤ exp − d = exp − ln = δ.
8k 8k δ

Theorem 9.3.1 Let L be a linear program with n constraints over d variables, let k > 0 be a parameter, and δ > 0 a confidence
parameter. Then one can compute the optimal solution to L violating at most k constraints of L, in O(mkd log(1/δ)) time. The
solution returned is correct with probability ≥ 1 − δ.

9.4 Approximate Linear Programming with Violations

The magic of Theorem 9.3.1 is that it provides us with a linear programming solver which is robust and can handle a small number
√
of factious constraints. But what happens if the number of violated constraints k is large? As a concrete example, for k = n and
a LP with n constraints (defined over d variables) the algorithm for computing optimal solution violating k constraints has running
time roughly O(n1+d/2 ). In this case, if one still wants a near linear running time, one can use random sampling to get approximate
solution in near linear time.

Lemma 9.4.1 Let L be a linear program with n constraints over d variables, let k > 0 and ε > 0 be parameters. Then one can
compute a solution to L violating at most (1 + ε)k constraints of L such that its value is better than the optimal solution violating
k constraints of L. The expected running time of the algorithm is
!!
logd+1 n logd+2 n
O n + n min , .
ε2d k ε2d+2
The algorithm succeeds with high probability.

b with probability ρ. Next, the algorithm computes optimal
Proof: Let ρ = O kεd2 ln n and pick each constraints of L into L
b violating
solution u in L
k0 = (1 + ε/3)ρk
‘constraints, and return this as the required solution.

I am sure the reader guessed correctly the consequences of such a despicable scenario: The universe collapses and is replaced
by a cucumber.

100
We need to prove the correctness of this algorithm. To this end, the reliable sampling lemma (Lemma 8.3.1) states that for any
vertex v of depth u in L, has depth in the range
(1 − ε/3)uρ, (1 + ε/3)uρ

b and this holds with high probability, where u ≥ k (here we are using the fact that there are at most nd vertices defined by L).
in L,
b
In particular, let vopt be the optimal solution for L of depth k. With high probability, vopt has depth ≤ (1 + ε/3)pk = k0 in L,
0 b
which implies that the returned solution v is better than vopt , since v has depth k in L.
Next, we need to prove that v is not too deep. So, assume that v is of depth β in L. By the reliable sampling lemma, we have
b
that the depth of v in L is in the range (1 − ε/3)βρ, (1 + ε/3)βρ . In particular, we know that (1 − ε/3)βρ ≤ k0 = (1 + ε/3)ρk. That
is
1 + ε/3
β≤ k ≤ (1 + ε/3)(1 + ε/2)k ≤ (1 + ε)k,
1 − ε/3
since 1/(1 − ε/3) ≤ 1 + ε/2 for ε ≤ 1® .
As for the running time, we are using the algorithm of Theorem 9.3.1. The input size is O(nρ) and the depth threshold is k0 .
(The bound on the input size holds with high probability. We omit the easy but the tedious proof of that using Chernoff inequality.)
As such, the running time is
!!
logd+1 n logd+2 n
O n + nρ(k0 )d log n = O n + n min n, nρ (ρk)d log n = O n + n min , .
ε2d k ε2d+2

Note, that the running time of Lemma 9.4.1 is linear if k is sufficiently large and ε is fixed.

9.5 LP-type problems

Interestingly, the above algorithms for linear programming can be extended to more abstract settings. Indeed, assume we are given
a set of constraints H, and a function w, such that for any subset G ⊂ H returns the value of the optimal solution of the constraint
problem when restricted to G. We denote this value by w(G). Our purpose is to compute w(H).
For example, H is a set of points in IRd , and w(G) is the radius of the smallest ball containing all the points of F ⊆ H. As
such, in this case, we would like to compute (the radius of) the smallest enclosing ball for H.
We assume that the following axioms hold:
1. (Monotonicity.) For any F ⊆ G ⊆ H, we have
w(F) ≤ w(G) .
2. (Locality.) For any F ⊆ G ⊆ H, with −∞ < w(F) = w(G), and any h ∈ H, if

w(F) < w(F ∪ {h}) then w(G) < w(G ∪ {h}) .

If these two axioms holds, we refer to (H, w) as a LP-type problem. It is easy to verify that linear programming is a LP-type
problem.

Definition 9.5.1 A basis is a subset B ⊆ H such that w(B) > −∞, and w(B0 ) < w(B), for any proper subset B0 of B.

As in linear programming, we have to assume that certain basic operations can be performed quickly. These operations are:
(A) (Violation test.) For a constraint h and a basis B, test whether h is violated by B or not. Namely, test if w(B ∪ {h}) > w(B).
(B) (Basis computation.) For a constraint h, and a basis B, computes the basis of B ∪ {h}.
We also need to assume that we are given an initial basis B0 from which to start our computation. The combinatorial dimension
of (H, w) is the maximum size of s basis of H. Its easy to verify that the algorithm we presented for linear programming (the special
case of Section 9.2.1) works verbatim in this settings. Indeed, start with B0 , and randomly permute the remaining constraints. Now,
add the constraints in a random order, and each step check if the new constraints violates the current solution, and if so, update the
basis of the new solution. The recursive call here, corresponds to solving a subproblem where some members of the basis are fixed.
We conclude:

Theorem 9.5.2 Let (H, w) be a LP-type problem with n constraints with combinatorial dimension d. Assume that the basic
operations takes constant time, we have that (H, w) can be solved using dO(d) m basic operations.
®
Indeed (1 − ε/3)(1 + ε/2) ≤ 1 − ε/3 + ε/2 − ε2 /6 ≥ 1.

101
9.5.1 Examples for LP-type problems
Smallest enclosing ball. Given a set P of n points in IRd , and let r(P) denote the radius of the smallest enclosing ball in IRd .
Under general position assumptions, there are at most d + 1 points on the boundary of this smallest enclosing ball. We claim that
the problem is an LP-type problem. Indeed, the basis in this case is the set of points determining the smallest enclosing ball. The
combinatorial dimension is thus d + 1. The monotonicity property holds trivially. As for the locality property, assume that we have
a set Q ⊆ P such that r(Q) = r(P). As such, P and Q have the same enclosing ball. Now, if we add a point p to Q and the radius of
its minimum enclosing ball increases, then the ball enclosing P must also change (and get bigger) when we insert p to P. Thus, this
is a LP-type problem, and it can be solved in linear time.

Theorem 9.5.3 Given a set P of n points in IRd , one can compute its smallest enclosing ball in (expected) linear time.

Finding time of first intersection. Let C(t) be a parameterized convex shape in IRd , such that C(0) is empty, and C(t) (
C(t0 ) if t < t0 . We are given n such shapes C1 , . . . , Cn , and we would like to decide the minimal t for which they all have a common
intersection. Assume, that given a point p and such a shape C, we can decide (in constant time) the minimum t for which p ∈ C(t).
Similarly, given (say) d +1 of these shapes, we can decide in constant time the minimum t for which they intersect, and this common
point of intersection. We would like to find the minimum t for which they all intersect. Let also assume that these shapes are well
behaved in the sense that, for any t, we have lim∆→0 Vol(C(t + ∆) \ C(t)) = 0 (namely, such a shape can not “jump” – it grows
continuously). It is easy to verify that this is a LP-type problem, and as such it can be solved in linear time.
Note, that this problem is an extension of the previous problem. Indeed, if we group a ball of radius t around each point of P,
then the problem of deciding the minimal t when all these growing balls have a non-empty intersection, is equivalent to finding the
minimum radius ball enclosing all points.

9.6 Bibliographical notes

History. Linear programming has a rich and fascinating history. It can be traced back to the early 19th century. It started in
earnest in 1939 when L. V. Kantorovich noticed the importance of certain type of Linear Programming problems. Unfortunately,
for several years, Kantorovich work was unknown in the west and unnoticed in the east.
Dantzig, in 1947, invented the simplex method for solving LP problems for the US Air force planning problems. T. C.
Koopmans, in 1947, showed that LP provide the right model for the analysis of classical economic theories. In 1975, both Koopmans
and Kantorovich got the Nobel prize of economics. Dantzig probably did not get it because his work was too mathematical. So it
goes.
The simplex algorithm was developed before computers (and computer science) really existed in wide usage, and its standard
description is via a careful maintenance of a tableau of the LP, which is easy to handle by hand (this might also explain the
unfortunate name “linear programming”). This makes however the usual description of the simplex algorithm pretty mysterious
and counter-intuitive. Furthermore, since the universe is not in general position (as we assumed), there are numerous technical
difficulties (that we glossed over) in implementing any of this algorithms, and the descriptions of the simplex algorithms usually
detail how to handle these cases. See the book by Vanderbei [Van97] for an accessible and readable coverage of this topic.

Linear programming in low dimensions. The first to realize that linear programming can be solved in linear time
in low dimensions was Megiddo [Meg83, Meg84]. His algorithm was deterministic but considerably more complicated than
the randomized algorithm we present. Clarkson [Cla95] showed how to use randomization to get a simple algorithm for linear
programming with running time O(d2 n + noise), where the noise is a constant exponential in d. Our presentation follows the paper
by Seidel [Sei91]. Surprisingly, one can achieve running time with the noise being subexponential in d. This follows by plugging
in the subexponential algorithms of Kalai
[Kal92]orpMatoušek et al. [MSW96] into Clarkson algorithm [Cla95]. The resulting
algorithm has expected running time O d2 n + exp c d log d , for some constant c. See the survey by Goldwasser [Gol95] for
more details.

More information on Clarkson’s algorithm. Clarkson’s algorithm contains some interesting new ideas that are worth mentioning
shortly. (Matoušek et al. [MSW96] algorithm is somewhat similar to the algorithm we presented.)
Observe that if the solution for a random sample R is being violated by a set X of constraints, then X must
√ contains (at least)
one constraint
√ which is in the basis of the
√ optimal solution. Thus, by picking R to be of size (roughly) n, we know that it is
a 1/ n-net, and there would be at most n constraints violating the solution of R. Thus, repeating this d times, at each stage
solving√the problem on the collected constraints from previous iteration, together with the current random sample, results in a set
of O(d n) constraints that contains the optimal basis. Now solve recursively the linear program on this (greatly reduced) set of
constraints. Namely, we spent O(d2 n) time (d times √ checking if the n constraints violates a given solution), called recursively d
times on “small” subproblems of size (roughly) O( n), resulting in a fast algorithm.

102
An alternative algorithm, uses the same observation, by using the reweighting technique. Here each constraint is sampled
according to its weight (which is initially 1. By doubling the weight of the violated constraints, one can argue that after a small
number of iterations, the sample would contain the required basis, while being small. See Chapter 17 for more details.
Clarkson algorithm works by combining these two algorithms together.

Linear programming with violations. The algorithm of Section 9.3 seems to be new, although it is implicit in the work
of Matoušek [Mat95b], which present a slightly faster deterministic algorithm. The first paper on this problem (in two dimensions),
is due to Everett et al. [ERvK96]. This was extended by Matoušek to higher dimensions [Mat95b]. His algorithm relies on the
idea of computing all O(kd ) local maximas in the “k-level” explicitly, by traveling between them. This is done by solving linear
programming instances which are “similar”. As such, these results can be further improved using techniques for dynamic linear
programming that allows insertion and deletions of constraints, see the work by Chan [Cha96]. Chan [Cha05] showed how to
further improve these algorithms for dimensions 2, 3 and 4, although these improvements disappear if k is close to linear.
The idea of approximate linear programming with violations is due to Aronov and Har-Peled [AH05], and our presentation
follows their results. Using more advanced data-structures these results can be further improved (as far as the polylog noise is
concerned), see the work by Afshani and Chan [AC07].

LP-type problems. The notion of LP-type algorithm is mentioned in the work of Sharir and Welzl [SW92]. They also showed
up that deciding if a set of (axis parallel) rectangles can be pierced by 3-points is a LP-type problem (quite surprising as the problem
has no convex programming flavor). Our example of computing first intersection of growing convex sets, is motivated by the work
of Amenta [Ame94] on the connection between LP-type problems and Helly-type theorems.
Intuitively, any lower dimensional convex programming problem is a natura‘l candidate to be solved using LP-type techniques.

9.7 Exercises

103
104
Chapter 10

Polyhedrons, Polytopes and Linear

Programming

I don’t know why it should be, I am sure; but the sight of another man asleep in bed when I am up, maddens me. It
seems to me so shocking to see the precious hours of a man’s life - the priceless moments that will never come back
to him again - being wasted in mere brutish sleep.
– – Three men in a boat, Jerome K. Jerome

In this chapter, we formally investigate how the feasible region of a linear program looks like, and establish the correctness
of the algorithm we presented for linear programming. Linear programming seems to be one case where the geometric intuition is
quite clear, but crystallizing it into a formal proof requires quite a bit of work. In particular, we prove in this chapter more than we
strictly need, since it support (and may we dare suggesting, with due humble and respect to the most esteemed reader, that it even
expands¯ ) our natural intuition.
Underlining our discussion is the dichotomy between the input to LP, which is a set of halfspaces, and the entities LP works
with, which are vertices. In particular, we need to establish that speaking about the feasible region of a LP in terms of (convex hull
of) vertices, or alternatively, as the intersection of halfspaces is equivalent.

10.1 Preliminaries
We had already encountered Radon’s theorem, which we restate.

Theorem 10.1.1 (Radon’s Theorem.) Let P = {p1 , . . . , pd+2 } be a set of d + 2 points in IRd . Then, there exists two disjoint subsets
Q and R of P, such that CH(Q) ∩ CH(R) , ∅.

Theorem 10.1.2 (Helly’s theorem.) Let F be a set of n convex sets in IRd . The intersection of all the sets of F is non-empty if and
only if any d + 1 of them has non-empty intersection.

Proof: This theorem is “dual” to Radon’s theorem.

If the intersection of all sets in F is non-empty then any intersection of d + 1 of them is non-empty. As for the other direction,
assume for the sake of contradiction, that F is the minimal set of convex sets for which the claim fails. Namely, for m = |F| > d + 1,
any subset of m − 1 sets of F has non-empty intersection, and yet the intersection of all the sets of F is empty.

As such, for X ∈ F, let pX be a point in the intersection of all sets of F excluding X. Let P = pX X ∈ F . Here |P| = |F| >
d + 1. By Radon’s theorem, there is a partition of P into two disjoint sets R and Q such that CH(R) ∩ CH(Q) , ∅. Let r be any
point inside this non-empty
intersection.

Let U(R) = X pX ∈ R and U(Q) = X pX ∈ Q be the two subsets of F corresponding to R and Q, respectively. By
definition, we have that, for X ∈ U(R), it holds
\ \ \
pX ∈ Y⊆ Y= Y,
Y∈F,Y,X Y∈F\U(R) Y∈U(Q)

¯
Hopefully the reader is happy we are less polite to him/her in the rest of the book, since otherwise the text would be insufferably
tedious.

105
T T
since U(Q) ∪ U(R) = F. As such, R ⊆ Y∈U(Q) Y and Q ⊆ Y∈U(R) Y. Now, by the convexity of the sets of F, we have
T T
CH(R) ⊆ Y∈U(Q) Y and CH(Q) ⊆ Y∈U(R) Y. Namely, we have
   
 \   \  \
r ∈ CH(R) ∩ CH(Q) ⊆    
Y  ∩  Y  = Y.
Y∈U(Q) Y∈U(R) Y∈F

Namely, the intersection of all the sets of F is not empty. A contradiction.

Theorem 10.1.3 (Carathéodory theorem.) Let X be a convex set in IRd , and let p be some point in the interior of X. Then p is a
convex combination of d + 1 points of X.
P
Proof: Suppose p = mk=1 λi xi is a convex combination of m > d + 1 points, where {x1 , . . . , xm } ⊆ X, λ1 , . . . , λm > 0 and
P
i λi = 1. We will show that p can be rewritten as a convex combination of m − 1 of these points, as long as m > d + 1.
So, consider the following system of equations
X
m X
m
γi xi = 0 and γi = 0. (10.1)
i=1 i=1

It has m > d + 1 variables (i.e., γ1 , . . . , γm ) but only d + 1 equations. As such, there is a non-trivial solution to this system of
equations, and denote it by b γm . Since b
γ1 , . . . ,b γ1 + · · · + b
γm = 0, some of b γi s are strictly positive, and some of them are strictly
negative. Let
λj
τ = min > 0.
γ j >0 b
j, b γj
And assume without loss of generality, that τ = λ1 /b
γ1 . Let

e
λi = λi − τb
γi ,

for i = 1, . . . , m. Then e γ1 b
λ1 = λi − λ1 /b γi = 0. Furthermore, if b γi < 0, then e γi ≥ λi > 0. Otherwise, if b
λi = λi − τb γi > 0, then
!
λj λi
eλi = λi − min b
γi ≥ λi − b γi ≥ 0.
γ j >0 b
j, b γj b
γi

So, e
λ1 = 0 and e
λ2 , . . . , e
λm ≥ 0. Furthermore,
X
m X
m
X X m X m m
e
λi = λi − τb
γi = λi − τ b
γi = λi = 1,
i=1 i=1 i=1 i=1 i=1

P
since b γm is a solution to Eq. (10.1). As such, q = mi=2 e
γ1 , . . . ,b λi xi is a convex combination of m − 1 points of X. Furthermore, as
e
λ1 = 0, we have
X
m X
m X
m
X m X X m X m m
q= e
λi xi = e
λi xi = λi − τb
γi xi = λi xi − τ b
γi xi = λi xi − τ 0 = λi xi = p,
i=2 i=1 i=1 i=1 i=1 i=1 i=1

since (again) b
γ1 , . . . ,b
γm is a solution to Eq. (10.1). As such, we found a representation of p as a convex combination of m − 1 points,
and we can continue in this process till m = d + 1, establishing the theorem.

10.1.1 Properties of polyhedrons

A H-polyhedron (or just polyhedron) is the region formed by the intersection of (finite number of) halfspaces. Namely, its the
feasible region of a LP.
It would be convenient to consider the LP to be specified in a matrix form. Namely, we are given matrix M ∈ IRm×d with m
rows and d columns, and a m dimensional vector b, and a d dimensional c. Our purpose, is to find a x ∈ IRd , such that M x ≤ b,
while c · x is minimized. Namely, the ith row of M corresponds to the ith inequality of the LP, that is the inequality Mi x ≤ bi , for
i = 1, . . . , m, where Mi denotes the ith row of M.
In this form it is easy to combine inequalities together: You multiply several rows by positive constants, and add them up (you
also need to do this for the relevant constants in b), and get a new inequality. Formally, given a row vector w ∈ IRm , such that w ≥ 0
(i.e., all entries are non-negative), the resulting inequality is wM ≤ w · c. Note that such an inequality must hold inside the feasible
region of the original LP.

106
Fourier-Motzkin elimination. Let L = (M, b) be an instance of LP (we care only about feasibility here, so we ignore the
target function). Consider the ith variable xi in the LP L. If it appears only with positive coefficient in all the inequalities (i.e., the
ith column of M has only positive numbers) then we can set the ith variable to be sufficiently large negative number, and all the
inequalities where it appear would be immediately feasible. Same holds if all such coefficients are negative. Thus, consider the case
where the variable xi appears with both negative coefficients and positive coefficients in the LP. Let us inspect two such inequalities,
say the kth and lth inequality, and assume, for the sake of concreteness, that Mki > 0 and Mli < 0. Clearly, we can multiply the lth
inequality by a positive number and add it to the kth inequality, so that in the resulting inequality the coefficient of xi is zero.
Let L0 = elim(L, i) denote the resulting linear program, where we copy all inequalities of the original LP where xi has zero as
coefficient. In addition, we add all the inequalities that can formed by taking “positive” and “negative” appearances in the original
LP of xi and canceling them out as described above. Note, that L0 might have m2 /4 inequalities, but since xi is now eliminated (i.e.,
all of its appearances are with coefficient zero), the LP L0 is defined over d − 1 variables.

Lemma 10.1.4 Let L be an instance of LP with d variables and m inequalities, the linear program L0 = elim(L, i) is feasible if and
only if L is feasible, for any i ∈ {1, . . . , d}.

Proof: One direction is easy, if L is feasible then its solution (omitting the ith variable) is feasible for L0 .
The other direction, becomes clear once we understand what the elimination really do. So, consider two inequalities in L, such
that Mki < 0 and M ji > 0. We can rewrite these inequalities such that they become
X
a0 + aτ xτ ≤ xi (A)
τ,i
X
and xi ≤ b0 + bτ xτ , (B)
τ,i

respectively. The eliminated inequality described above, is no more than the inequality we get by chaining these inequalities
together; that is X X X X
a0 + aτ xτ ≤ xi ≤ b0 + bτ xτ ⇒ a0 + aτ xτ ≤ b0 + bτ xτ .
τ,i τ,i τ,i τ,i

In particular, for a feasible solution to L0 , all the left sides if inequalities of type (A) must be smaller (equal) than the right side of
all the inequalities of type (B), since we combined all such pairs of inequalities into an inequality in L0 .
In particular, given a feasible solution sol to L0 , one can extend it into a solution of L by computing a value of xi , such that all the
original inequalities hold. Indeed, every pair of inequalities as above of L, when we substitute the values of x1 , . . . , xi−1 , xi+1 , . . . , xd
in sol into L, results in an interval I, such that xi must lie inside I. Each inequality of type (A) induces a left endpoint of such
an interval, and each inequality of type (B) induced a right endpoint of such an interval. We create all possible intervals of this
type (using these endpoints) when creating L0 , and as such for sol, all these intervals must be non-empty. We conclude that the
intersection of all these intervals is non-empty, implying that one can pick a value to xi such that L is feasible.

Given a H-polyhedron P , the elimination of the ith variable elim(P , i) can be interpreted as projecting the polyhedron P into
the hyperplane xi = 0. Furthermore, the projected polyhedron is still a H-polyhedron. By change of variables, this implies that any
projection a H-polyhedron into a hyperplane, is a H-polyhedron. We can repeat this process of projecting down the polyhedron
several times, which implies the following lemma.

Lemma 10.1.5 The projection of a H-polyhedron into any affine subspace is a H-polyhedron.

Lemma 10.1.6 (Farakas Lemma) Let M ∈ IRm×d and b ∈ IRm specify a LP. Then either:
(i) There exists a feasible solution x ∈ IRd to the LP. That is M x ≤ b,
(ii) Or, there is no feasible solution, and we can prove it by combining the inequalities of the LP into an infeasible inequality.
Formally, there exists a vector w ∈ IRm , such that w ≥ 0, wM = 0 and w · b < 0. Namely, the inequality (that must hold if the
LP is feasible)
(wM) x ≤ w · b
is infeasible.

Proof: Clearly, the two options can not hold together, so all we need to show is that if an LP is infeasible than the second
option holds.
Observe, that if we apply a sequence of eliminations to an LP, all the resulting inequalities in the new LP are positive combi-
nation of the original inequalities. So, let us apply this process of elimination to each one of the variables in turn. If the original LP
is not feasible, then sooner or later, this elimination process would get stuck, as it will generate an infeasible inequality. Such an
inequality would have all the variables with zero coefficients, and the constant would be negative. Thus establishing the claim.

107
Note, that the above elimination process provides us with a naive algorithm for solving LP. This algorithm is extremely
d
inefficient, as in the last elimination stage, we might end up with m2 inequalities. This would lead to an algorithm which is
unacceptably slow for solving LP.
P P
We remind the reader that a linear equality of the form i ai xi = c can be rewritten as the two inequalities i ai xi ≤ c and
P
i ai xi ≥ c. The great success of the Farakas Lemma in the market place had lead to several sequels to it, and here is one of them.

Lemma 10.1.7 (Farakas Lemma II) Let M ∈ IRm×d and b ∈ IRm , and consider the following linear program M x = b, x ≥ 0. Then,
either (i) this linear program is feasible, or (ii) there is a vector w ∈ IRm such that wM ≥ 0 and w · b < 0.

Proof: If (ii) holds, then consider a feasible solution x to the LP. We have that M x = b, multiplying this equality from the left
by w, we have that
(wM)x = w · b.
But (ii) claims that the quantity on the left is non-negative (since x ≥ 0), and the quantity on the right is negative. A contradiction.
As such these two options are mutually exclusive.
The linear program M x = b, x ≥ 0 can be rewritten as M x ≤ b, M x ≥ b, x ≥ 0, which in turn is equivalent to the LP:
   
Mx ≤ b  M   b 
   
−M x ≤ −b ⇐⇒  −M  x ≤  −b  ,
   
−x ≤ 0 −Id 0
where Id is the d × d identity matrix. Now, if the original LP does not have a feasible solution, then by the original Farakas lemma
(Lemma 10.1.6), there must be a vector (w1 , w2 , w3 ) ≥ 0 such that
   
 M   b 
   
(w1 , w2 , w3 ) −M  = 0 and (w1 , w2 , w3 ) ·  −b  < 0.
  
   
−Id 0
Namely, (w1 − w2 ) M − w3 = 0 and (w1 − w2 ) · b < 0. But w3 ≥ 0, which implies that
(w1 − w2 ) M = w3 ≥ 0.
Namely, the claim holds for w = w1 − w2 .

Cones and vertices. For a set of vectors V ⊆ IRd , let cone(V) denote the cone they generate. Formally,
 
 
X →
 − →
− 

cone(V) = 
 ti v i v i ∈ V, ti ≥ 0 
 .
 
i

A halfspace passes through a point p, if p is contained in the hyperplane bounding this halfspace. Since 0 is the apex of cone(V),
it is natural to presume that cone(V) can be generated by a finite intersection of halfspaces, all of them passing through the origin
(which is indeed true).
In the following, let e(i, d) denote the ith orthonormal vector in IRd ; namely,
i−1 coords d−i coords
z }| { z }| {
e(i, d) = ( 0, . . . , 0, 1, 0, . . . , 0 ).

Lemma 10.1.8 Let M ∈ IRm×d be a given matrix, and consider the H-polyhedron P formed by all points (x, w) ∈ IRd+m , such that
M x ≤ w. Then P = cone(V), where V is a finite set of vectors in IRd+m .

Proof: Let Ei = e(i, d + m), for i = d + 1, . . . , d + m. Clearly, the inequality M x ≤ w trivially holds if (x, w) = Ei , for
i = d + 1, . . . , d + m (since x = 0 in such a case). Also, let

→−
v i = e(i, d) , M e(i, d) ,
→− n→
− →
− o
for i = 1, . . . , d. Clearly, the inequality holds for v i , since trivially, M e(i, d) ≤ M e(i, d), for i = 1, . . . , d. Let V = v 1 , . . . , v d , Ed+1 , . . . , Ed+m .
We claim that cone(V) = P . One direction is easy, as the inequality holds for all the vectors in V, it also holds for any
positive combination of these vectors, implying that cone(V) ⊆ P . The other direction is slightly more challenging. n→ Consider an
− − o
→
(x, w) ∈ IRd+m such that M x ≤ w. Clearly, w − M x ≥ 0. As such, (x, w) = (x, M x) + (0, w − M x). Now, (x, M x) ∈ cone v 1 , . . . , v d
n→
− →
− o
and since w−M x ≥ 0, we have (0, w−M x) ∈ cone(Ed+1 , . . . , Ed+m ). Thus, P ⊆ cone v 1 , . . . , v d +cone(Ed+1 , . . . , Ed+m ) = cone(V).

We need the following simple pairing lemma, which we leave as an exercise for the reader (see Exercise 10.5.1).

As in most sequels, Farakas Lemma II is equivalent to the original Farakas Lemma. I only hope the reader does not feel
cheated.

108
P P
Lemma 10.1.9 Let α1 , . . . , αn , β1 , . . . , βm be positive numbers, such that ni=1 αi = mj=1 β j , and consider the positive combination
→
− Pn →− Pm → − →− →− → − →−
w = i=1 αi v i + i= j β j u j , where v 1 , . . . , v n , u 1 , . . . , u m are vectors (say, in IRd ). Then, there are non-negative δi, j s, such that
− P
→ →
− → −
w = i, j δi, j v i + u j .

Lemma 10.1.10 Let C = cone(V) be a cone generate by a set of vectors V in IRd . Consider the region P = C ∩ h, where h is a
hyperplane that passes through the origin. Then, P is a cone; namely, there exists a set of vectors V0 such that P = cone(V0 ).

Proof: By rigid rotation of the axis system, we can assume that h ≡ x1 = 0. Furthermore, by scaling, we can assume that the
first coordinate of all points in V is either −1, 0 or 1 (clearly, scaling of vectors generating the cone does not effect the cone itself).
Let V = V−1 ∪ V0 ∪ V1 , where V0 is the set of vectors in V with the first coordinate being zero, and V−1 (resp. V1 ) are the vectors
in V with the first
n→ coordinateo being n−1 (resp. 1).o
− →
− →
− →−
Let V−1 = v 1 , . . . , v n , V1 = u 1 , . . . , u m , and let

− →
→ −
V0 = V0 ∪ v i + u j i = 1, . . . , n, j = 1, . . . , m .

Clearly, all the vectors of V0 have zero in their first coordinate, and as such V0 ⊆ h, implying that cone(V0 ) ⊆ C ∩ h.
→− →− → − P → − P → − →−
As for the other direction, consider a vector w ∈ C∩h. It can be rewritten as w = w 0 + i αi v i + j β j u j , where w 0 ∈ cone(V0 ).
→
− P P
Since the first coordinate of w is zero, we must have that i αi = j β j . Now, by the above pairing lemma (Lemma 10.1.9), we
have that there are (non-negative) δi, j s, such that
X
→
− → − →− → −
w = w0 + δi, j v i + u j ∈ cone(V0 ) ,
i, j
0
implying that C ∩ h ⊆ cone(V ).

We next handle the general question of how the intersection of a general cone with a hyperplane looks like.

Lemma 10.1.11 Let C = cone(V) be a cone generate by a set of vectors V in IRd . Consider the region P = C ∩ h, where h is any
hyperplane in IRd . Then, there exists sets of vectors U and W such that P ∩ h = CH(U) + cone(W).

Proof: As before, by change of variables, we can assume that h ≡ x1 = 1. As before, we can normalize the vectors of V so that
the first coordinate is either 0, −1 or 1. Next, as before, break V into three disjoint sets, such that V = V−1 ∪ V0 ∪ V1 , where Vi are
the vectors in V with the first coordinate being i, for i ∈ {−1, 0, 1}.
By Lemma 10.1.10, there exists a set of vectors W0 that spans cone(W0 ) = C ∩ (x1 = 0). Also, let X = cone(V1 ) ∩ (x1 = 1). A
P → − →−
point p ∈ X is a positive combination of p = i ti v i , where v i ∈ V1 , where ti ≥ 0 for all i. But the first coordinate of all the points
P
of V1 is 1, and so is the first coordinate of p. Namely, it must be that i ti = 1. Implying that X = CH(V1 ).
We claim that P = cone(V) ∩(x1 = 1) is equal to Y = CH(V1 ) + cone(W0 ). One direction is easy, as V1 , W0 ⊆ cone(V), it
follows that Y ⊆ cone(V). Now, a point of Y is the sum of two vectors, one of them has 1 in the first coordinate, and the other has
zero there. As such, a point of Y lies on the hyperplane x1 = 1, implying Y ⊆ P .
As for the other direction, consider a point p ∈ P . It can be written as
X X X
→
− →
− →
−
p= αi v i + βj u j + γk w k ,
→− →− →−
i, v i ∈V1 j, u j ∈V0 k, w k ∈V−1

where αi , β j , γk ≥ 0, for all i, j, k. Now, by considering only the first coordinate of this sum of vectors, we have
X X
αi − γk = 1.
i k
P P P
In particular, αi can be rewritten as αi = ai + bi , where ai , bi ≥ 0, i bi = k γk and i ai = 1. As such,
 
X  X X X 
→−  →
− →
− − 
→
p= ai v i +  βj u j + bi v i + γk w k  ,
→−  →− →−  →−
i, v i ∈V1 j, u j ∈V0 i, v i ∈V1 k, w k ∈V−1
 
 X X 
P P  →
− − 
→
Now, since i bi = k γk , we have  bi v i + γk w k  ∈ (x1 = 0). As such, we have
→− →
−

i, v i ∈V1 k, w k ∈V−1
X X X
→
− →
− →
−
βj u j + bi v i + γk w k ∈ cone(W0 ) .
→− →− →−
j, u j ∈V0 i, v i ∈V1 k, w k ∈V−1

P P →
−
v i ∈V1 ai v i ∈ CH(V1 ). Thus p is a sum of two points, one of them in cone(W0 )
Also, i ai = 1 and, for all i, ai ≥ 0, implying that i,→
−
).
and the other in CH(V1 Implying that p ∈ Y and thus P ⊆ Y. We conclude that P = Y.

109
Theorem 10.1.12 A cone C is generated by a finite set V ⊆ IRd (that is C = cone(V)), if and only if, there exists a finite set of
halfspaces, all passing through the origin, such that their intersection is C.
In linear programming lingo, a cone C is finitely generated by V, if and only if there exists a matrix M ∈ IRm×d , such that x ∈ C
if and only if M x ≤ 0.

Proof:
n→ Let C = ocone(V), and observe that a point p ∈ cone(V) can be written as (part of) a solution to a linear program, indeed
− →
−
let V = v 1 , . . . , v m , and consider the linear program:

X
m
→
−
x ∈ IRd ti v i = x
i=1

∀i ti ≥ 0.
P → −
Clearly, any x ∈ C, there are t1 , . . . , tm , such that i ti v i = x. Thus, the projection of this LP with m + d variables into the subspace
formed by the coordinates of the d variables x = (x1 , . . . , xd ) is the set C. Now, by Lemma 10.1.5, the projection of this LP is also a
H-polyhedron, implying that C is a H-polyhedron.
As for the other direction, consider the H-polyhedron P formed by the set of points (x, w) ∈ IRd+m , such that M x ≤ w. By
Lemma 10.1.8, there exists V ⊆ IRd+m such that cone(V) = P . Now,

C = P ∩ (w = 0) = cone(V) ∩(w1 = 0) ∩ · · · ∩(wm = 0) .

A repeated application of Lemma 10.1.10 implies that the above set is a cone generated by a (finite) set of vectors, since wi = 0 is
a hyperplane, for i = 1, . . . , m.

Theorem 10.1.13 A region P is a H-polyhedron in IRd , if and only if, there exist finite sets P, V ⊆ IRd such that P = CH(P) +
cone(V).
P
Proof: Consider the linear inequality di=1 ai xi ≤ b, which is one the constraints defining the polyhedron P . We can lift
this inequality into an inequality passing through the origin, by introducing an extra variable. The resulting inequality in IRd+1 is
Pd
i=1 ai xi − bxd+1 ≤ 0. Clearly, this inequality defines a halfspace that goes through the origin, and furthermore, its intersection with
xd+1 = 1 is exactly (P , 1) (i.e., the set of points in the polyhedron P concatenated with 1 as the last coordinate). Thus, by doing this
“lifting” for all the linear constraints defining P , we get a cone C with an apex in the origin, defined by the intersection of halfspaces
passing through the origin. As such, by Theorem 10.1.12, there exists a set of vectors V ⊆ IRd+1 , such that C = cone(V) and
C ∩(xd+1 = 1) = (P , 1). Now, Lemma 10.1.11 implies that there exists P, W ⊆ IRd+1 , such that C ∩(xd+1 = 1) = CH(P) + cone(W).
Dropping the d + 1 coordinate from this points, imply that P = CH(P0 ) + cone(W0 ), where P0 , and W0 are P and W projected on
the first d coordinates, respectively.
As for the other direction, assume that P = CH(P) + cone(V). Let P0 = (P, 1) and V0 = (V, 1) be the lifting of P and V to d + 1
dimensions by padding it with an extra coordinate. Now, clearly,

(P , 1) = CH(P0 ) + cone(V0 ) ⊆ cone(P0 ∪ V0 ) ∩(xd+1 = 1) .

In fact, the containment holds also in the other direction, since a point pßncone(P0 ∪ V0 ) ∩ (xd+1 = 1), is made out of a convex
combination of the points of P0 (since they have 1 in the (d + 1)th coordinate) and a positive combination of the points of V0 (that
have 0 in the (d + 1)th coordinate). Thus, we have that (P , 1) = cone(P0 ∪ V0 ) ∩(xd+1 = 1). The cone C 0 = cone(P0 ∪ V0 ) can be
described as a finite intersection of halfspaces (all passing through the origin), by Theorem 10.1.12. Let L be the equivalent linear
program. Now, replace xd = 1 into this linear program. This results a linear program over IRd , such that its feasible region is P .
Namely, P is a H-polyhedron.
A polytope is the convex hull of a finite point set. Theorem 10.1.13 implies that a polytope is also formed by the intersection
of a finite set of halfspaces. Namely, a polytope is a bounded polyhedron.
A linear inequality a · x ≤ b is valid for a polytope P if it holds for all x ∈ P . A face of P is a set

F = P ∩(a · x = b) ,

where a · x ≤ b is a valid inequality for P . The dimension of F is the affine dimension of the affine space it spans. As such, a vertex
is 0 dimensional. Intuitively, vertices are the “corners” of the polytopes.

Lemma 10.1.14 A vertex p of a polyhedron P can not be written as a convex combination of a set X of points, such that X ⊆ P
and p < X.

Proof: Let h be the hyperplane that its intersection with P is (only) p. Now, all the points of X must lie on one side of h, and
there is no point of X on h itself. As such, any convex combination of the points of X would lie strictly on one side of h.

110
Claim 10.1.15 Let P be a polytope in IRd . Then, every polytope is the convex hull of its vertices; namely, P = CH(vert(P )).
Furthermore, if for a set V ⊆ IRd , we have P = CH(V), then vert(P ) ⊆ V.

Proof: The polytope P is a bounded intersection of halfspaces. By Theorem 10.1.13, it can represented as the sum of a convex
hull of a finite point set, with a cone. But the cone here is empty, since P is bounded. Thus, there exists a fine set V, such that
P = CH(V). Let p be a point of V. If p can be expressed as convex combination of the points of V \{p}, then CH(V \ {p}) = CH(V),
since any point expressed as a convex combination involving p can be rewritten to exclude p. So, let X be the resulting set after we
remove all such superfluous vertices. We claim that X = vert(P ).
P P
Indeed, if p ∈ X, then the following LP does not have a solution: i ti = 1 and mi=1 ti pi = p, where X \ p = {p1 , . . . , pm }. In
matrix form, this LP is
   
 1 1 ... 1     1 
   t1   
   t2   
     
   ..  =   and (t1 , . . . , tm ) ≥ 0.
 p1 p2 ··· pm   .   p 
     
  tm  
| {z }
M

By Farakas Lemma II (Lemma 10.1.7), since this LP is not feasible, there exists a vector w ∈ IRd+1 , such that wM ≥ 0 and
w · (1, p) < 0. Writing w = (α, s), we can restate these inequalities as

for i = 1, . . . , m α + s · pi ≥ 0 and α + s · p < 0.

In particular, this implies that s · pi > s · p, for i = 1, . . . , m. As such, p lies on the hyperplane s · x = s · p, and all the other points
of X are strictly on one side of it. Thus, we conclude that p is a vertex of P = CH(X). Namely, X ⊆ vert(P ).
Since P = CH(X), and a vertex of P can not be written as a convex combination of other points inside P , by Lemma 10.1.14,
it follows that vert(P ) ⊆ X. Thus, P = CH(vert(P )).

10.1.2 Vertices of a polytope

Since a vertex is a corner of the polytope, we can cut it off by a hyperplane. This introduces a new face which captures the structure
of how the vertex is connected to the rest of the polytope. Formally, consider a polytope P , with V = vert(P ). Let w · x ≤ c be a
valid inequality for P , such that the intersection of P with the boundary hyperplane w · x = c is a vertex v ∈ V. Furthermore, for all
other u ∈ V, we have w · u < c1 < c, where c1 is some constant. (We are using here implicitly Claim 10.1.15.) The vertex figure of
P at v is
P / v = P ∩ w · x = c1
The set P /v depends (of course) on w and c1 , but its structure is independent of these values.
For a vertex v, let conev,P denote the cone spanned locally by P ; formally,

conev,P = v + cone P /v − v

Lemma 10.1.16 We have P ⊆ conev,P .

Proof: Consider a point p ∈ P . We can assume that for the hyperplane h ≡ (w · x = c1 ), defining P /v, it holds w · p < c1
and (otherwise, we can can translate h so this holds). In particular,
consider
the point q = vp ∩(P /v). By the convexity of P ,
we have q ∈ P , and as such q ∈ P /v. Thus, q − v ∈ cone P /v − v and thus p − v ∈ cone P /v − v , which implies that

p ∈ v + cone P /v − v = conev,P .

Lemma 10.1.17 Let h be a halfspace defined by w · x ≤ c3 , such that for a vertex v of P , we have w · v = c3 and h is valid for P /v
(i.e., for all x ∈ P /v we have w · x ≤ c3 ), then h is valid for P .

Proof: Consider the linear function f (x) = c3 − w · x. Its zero for v and non-negative on P /v. As such, its non-negative for
any ray starting at v and passing through a point of P /v. As such f (·) is non-negative for any point of conev,P , which implies by
Lemma ?? that f (·) is non-negative for P .

Lemma 10.1.18 There is a bijection between the k-dimensional faces of P that contain v and the (k − 1)-dimensional faces of P /v.
Specifically, for a face f of P , the corresponding face is
π(f) = f ∩ h,

111

where h ≡ w · x = c1 is the hyperplane of P /v. Similarly, for a (k − 1)-dimensional face g of P /v, the corresponding face of P is

σ(g) = affine(v, g) ∩ P ,
where affine(v, g) is the affine subspace spanned by v and the points of g.

Proof: For the sake of simplicity of exposition, assume that h ≡ xd = 0 and furthermore v[d ] > 0, where v[d ] denotes the dth
coordinate of v. This can always be realized bya rigid rotation
and translation of space.
A face f of P where v ∈ f is defined as P ∩ w2 · x = c2 , where w2 · x ≤ c2 is a valid inequality for P . Now,

π(f) = f ∩ h = P ∩ w2 · x = c2 ∩ h = (P ∩ h) ∩ w2 · x = c2 = (P /v) ∩ w2 · x = c2 ,

where w2 · x ≤ c2 is a valid inequality for the polytope P /v ⊆ P . As such, π(f) is a face of P /v. Note, that if f is k-dimensional and
k > 1, then it contains two vertices of P , which are on different sides of h, and as such π(f) is not empty.
For g be a face of P /v, defined as g = P /v ∩(w3 · x = c3 ), where w3 · x ≤ c3 is an inequality valid for P /v. Note, that by setting
w3 [d ] to be a sufficiently large negative number, we can guarantee that w3 · v < c3 .
For λ > 0, consider the convex combination of the two inequalities w3 · x ≤ c3 and xd ≤ 0; that is
h(λ) ≡ (w3 ) · x + λxd ≤ c3 .
Geometrically, as we increase λ ≥ 0, the halfspace h(λ) is formed by a hyperplane rotating around the affine subspace s formed by
the intersection of the hyperplanes w3 · x = c3 (for λ = 0) and xd = 0 (for λ = ∞).® Since the two original inequalities are valid for
P /v, it follows that h(λ) is valid for P /v, for any λ ≥ 0. On the other hand, v ∈ h(0) and v < h(∞). It follows, that there is a value
λ0 of λ, such that v lies on the hyperplane bounding h(λ0 ). Since h(λ0 ) is valid for P /v, it follows, by Lemma 10.1.17, that h(λ0 ) is
valid for P . As such, f = h(λ0 ) ∩ P is a face of P that contains both v and g. As such, affine(v, g) ∩ P ⊆ f. It remain to show equality.
So consider a point p ∈ f, and as before we can assume that p[d ] < 0. As such, r = pv ∩(xd = 0) is a point that is on the boundary
of P /v, and furthermore r ∈ g since h(λ0 ) and w3 · x ≤ c3 have the same intersection with the hyperplane xd = 0 (i.e., the boundary
of this intersection is s), which implies that r ∈ affine(v, g), and as such f = affine(v, g) ∩ P .
As such, the maps π and σ are well defined. We remain with the task of verifying that they are the inverse of each other.
Indeed, for a face g of P /v we have

π σ(g) = π(affine(v, g) ∩ P ) = (affine(v, g) ∩ P ) ∩ h = affine(g) ∩(P ∩ h)
= affine(g) ∩ P /v = g,
since v < h and g ⊆ h. Similarly, for a face f of P that contains v, we have

σ π(f) = σ f ∩ h = affine(v, f ∩ h) ∩ P = affine(f) ∩ P = f,

since affine(f) can be written as the affine subspace of v together with a set of points in f ∩ h.

10.2 Linear Programming Correctness

We are now ready to prove that the algorithms we presented for linear programing do indeed work. As a first step, we prove that
locally checking if we are in the optimal vertex is sufficient.

Lemma 10.2.1 Let v be a vertex of a H-polyhedron P in IRd , and let f (·) be a linear function, such that f (·) is non-decreasing
along all the edges leaving v (i.e., v is a “local” minimum of f (·) along the edges adjacent to v), then v realizes the global minimum
of f (·) on P .

Proof: Assume for the sake of contradiction that this is false, and let x be a point in P , such that f (x) < f (v). Let P /v = P ∩ h,
where h is a hyperplane.
By convexity, the segment xv must intersect P /v, and let y be this intersection point. Since P /v is a convex polytope in d − 1
P
dimensions, y can be written as a convex combination of d of its vertices u1 , . . . , ud ; namely, y = i αi ui , where αi ≥ 0 and
P
i αi = 1. By Lemma 10.1.18, each one of these vertices lie on edge of P adjacent to v, and by the local optimality of v, we have
f (ui ) ≥ f (v). Now, by the linearity of f (·), we have
 
X  X
f (y) = f  αi ui  =
 αi f (ui ) ≥ f (v).
i i

But y is a convex combination of x and v and f (x) < f (v). As such, f (y) < f (v). A contradiction.
®
As such, g ⊆ s. Not that we need this fact anywhere.

112
10.3 Garbage
In the following, assume that the polytope (i.e., feasible region) is bounded.

Lemma 10.3.1 Let P be a bounded polytope, and let V be the set of vertices of P . Then P = CH(V).

Consider the intersection of d − 1 hyperplanes of the LP with P . Clearly, this is either empty or a segment connecting two
vertices of P . If the intersection is not empty, we will refer to it as edge connecting the two vertices forming its endpoints. Consider
the graph G = G(V, E) formed by this set of edges. The target function assign each vertex a value. By general position assumption,
we can assume that no pair of vertices has the same target value assigned to it. A vertex is a sink if its target value is lower than all
its neighbors. Assume that G contains a single sink and its connected.
Start a traversal of G in an arbitrary vertex v, and repeatedly move to any of its neighbors that has lower target value than itself.
This walk would stop once we arrive to the sink, which is the required optimal solution to the LP. This is essentially the simplex
algorithm for Linear Programming.
To this end, we need to prove that G is connected and has a single sink.

10.4 Bibliographical notes

10.5 Exercises
Exercise 10.5.1 (Pairing lemma.) [3 Points]
P P
Prove Lemma 10.1.9. Specifically, let α1 , . . . , αn , β1 , . . . , βm be positive numbers, such that ni=1 αi = mj=1 β j , and consider
→
− Pn →− Pm → − →
− →− → − →−
the positive combination w = i=1 αi v i + i= j β j u j , where v 1 , . . . , v n , u 1 , . . . , u m are vectors (say, in IRd ). Then, there are
→− P →
− → −
non-negative δi, j s, such that w = i, j δi, j v i + u j .
In fact, at most n + m − 1 of the δi, j s have to be non-zero.

113
114
Chapter 11

Approximate Nearest Neighbor Search in

Low Dimension

“Napoleon has not been conquered by man. He was greater than all of us. But god punished him because he relied
on his own intelligence alone, until that prodigious instrument was strained to breaking point. Everything breaks in
the end.”
– Carl XIV Johan, King of Sweden

11.1 Introduction
Let P be a set of n points in IRd . We would like to preprocess it, such that given a query point q, one can determine the closest
point in P to q quickly. Unfortunately, the exact problem seems to require prohibitive preprocessing time. (Namely, computing the
Voronoi diagram of P, and preprocessing it for point-location queries. This requires (roughly) O(ndd/2e ) time.)
Instead, we will specify a parameter ε > 0, and build a data-structure that answers (1 + ε)-approximate nearest neighbor
queries.

Definition 11.1.1 For a set P ⊆ IRd , and a point q, we denote by b

q the closest point in P to q. We denote by dP (q) the distances
q .
between q and its closest point in P; that is dP (q) = q − b
For a query point q, and a set P of n points in IRd , the point r ∈ P is an (1 + ε)-approximate nearest neighbor (or just
(1 + ε)-ANN) if kq − rk ≤ (1 + ε)dP (q). Alternatively, for any s ∈ P, we have kq − rk ≤ (1 + ε) kq − sk.

This is yet another instance where solving the bounded spread case is relatively easy.

11.2 The bounded spread case

Let P be a set of n points contained inside the unit hypercube in IRd , and let T be a quadtree of P, where diam(P) = Ω(1). We
assume that with each (internal) node u of T, there is an associated representative point repu which is one of the points of P stored
in the subtree rooted at u.
Let q be a query point, such that q − b
q ≥ r and let ε > 0 be a parameter. We would like to find a (1 + ε)-ANN to q.

The algorithm. Let A0 = {root(T )}, and let rcurr = q − reproot(T ) . The value of rcurr is the distance to the closet neighbor of q
that was found so far by the algorithm.
In the ith iteration, for i > 0, the algorithm expands the nodes of Ai−1 to get Ai . Formally, for v ∈ Ai−1 , let Cv be the set of
children of v in T and 2v denote the cell (i.e., region) v corresponds to. For every node w ∈ Cv , we compute

rcurr ← min rcurr , q − repw .

The algorithm checks if

q − repw − diam(2w ) < (1 − ε/2)rcurr , (11.1)
and if so, it adds w to Ai . The algorithm continues in this expansion process till all the elements of Ai−1 were considered, and then
it moves to the next iteration. The algorithm stops when the generated set Ai is empty. The algorithm returns the point realizing the
value of rcurr as the ANN.

115
The set Ai is a set of nodes of depth i in the quadtree that the algorithm visits. Note, all these nodes belong to the canonical
grid G2−i of level −i, where every canonical square has sidelength 2−i . (Thus, nodes of depth i in the quadtree are of level −i. This
is somewhat confusing but it in fact makes the presentation simpler.)

Correctness. Note that the algorithm adds a node w to Ai only if the set Pw might contain points which are closer to q than the
(best) current nearest neighbor the algorithm found, where Pw is the set of points stored in the subtree of w. (More precisely, Pw
might contain a point which is (1 − ε/2) closer to q than any point encountered so far.)
Consider the last node w inspected by the algorithm such that b q ∈ Pw . Since the algorithm decided to throw this node away,
we have, by the triangle inequality, that

q ≥ q − rep − diam(2 ) ≥ (1 − ε/2)r .
q − b w w curr

Thus, q − b
q /(1 − ε/2) ≥ rcurr . However, 1/(1 − ε/2) ≤ 1 + ε, for 1 ≥ ε > 0, as can be easily verified. Thus, rcurr ≤ (1 + ε)dP (q),
and the algorithm returns an (1 + ε)-ANN to q.

Running time analysis. Before barging into a formal proof of the running
time of the above search procedure, it is useful to visualize the execution of
the algorithm. It visits the quadtree level by level. As long as the level grid
cells are bigger than the ANN distance r = dP (q), the number of nodes visited
is a constant (i.e., |Ai | = O(1)). This number “explodes” only when the cell Cell size
size become smaller than r, but then the search stops when we reach grid size ≈ dP (q)
O(εr). In particular, since the number grid cells visited (in the second stage)
grows exponentially with the level, we can use the number of nodes visited in Cell size
| {z } ≈ εdP (q)
the bottom level (i.e., O(1/εd )) to bound the query running time for this part of
O(1/εd )
the query.

Lemma 11.2.1 Let P be a set of n points contained inside the unit hypercube
in IRd , and let T be a quadtree
of P, where
diam(P) = Ω(1). Let q be a query point, and let ε > 0 be a parameter. An (1 + ε)-ANN
to q can be computed O ε−d + log(1/$) time, where $ = q − b q .

Proof: The algorithm is described above. We only left with the task of bounding the query time. Observe that if a node w ∈ T
is considered by the algorithm, and diam(2w ) < (ε/4)$ then

q − rep − diam(2 ) ≥ q − rep − (ε/4)$ ≥ r − (ε/4)r ≥ (1 − ε/4)r ,
w w w curr curr curr

which implies that neither w nor any of its children would be inserted into the sets A1 , . . ., Am , where m is the depth T, by Eq. (11.1).

Thus, no nodes of depth ≥ h = − lg($ε/4) are being considered by the algorithm. √
Consider the node u of T of depth i containing b q. Clearly, the distance between q and repu is at most ì = $+diamu = $+ d2−i .
As such, in the end of the ith iteration, we have rcurr ≤ ì , since the algorithm had inspected u. Thus, the only cells of G2−i−1 that
might be considered by the algorithm are the ones in distance ≤ ì from q. The number of such cells is
& '!d  √ d  !
ì  $ + d2−i   $ d d
ni = 2 −i−1 = O1 + 
  = O 1 + = O 1 + 2i $ ,
2 2 −i−1 2 −i−1

since for any a, b ≥ 0 we have (a + b)d ≤ (2 max(a, b))d ≤ 2d ad + bd . Thus, the total number of nodes visited is
  
X d− lg($ε/4)e d  !d  !
 X
h
 1 $  1 1
ni = O 1 + 2 $ , = Olg
i
+  = O log + d ,
  $ε $ε/4 $ ε
i=0 i=0

and this also bounds the overall query time.

One can apply Lemma 11.2.1 to the case the input has bounded spread. Indeed, if the distance between the closest pair of
points of P is µ = CP(P), then the algorithm would never search in (the children of) cells that have diameter ≤ µ/2, since all such
nodes are leafs. As such, we can replace in the above argumentation r by µ.

Lemma 11.2.2 Let P be a set of n points in IRd , and

let T be a quadtree
of P, where diam(P) = Ω(1). Given a query point q and
1 ≤ ε > 0, one can return an (1 + ε)-ANN to q in O 1/εd + log Φ(P) time.

A less trivial task, is to adapt the algorithm, so that it uses compressed quadtrees. To this end, the algorithm would still handle

the nodes by levels. This requires us to keep a heap of integers in the range 0, −1, . . . , − lg Φ(P) . This can be easily done by
maintaining an array of size O(log Φ(P)), where each array cell, maintains a linked list of all nodes with this level. Clearly, an
insertion/deletion into this heap data-structure can be handled in constant time by augmenting it with a hash table. Thus, the above
algorithm would work for this case after modifying it to use this “level” heap instead of just the sets Ai .

116
Theorem 11.2.3 Let P be a set of n points in IRd . One can preprocess P in O(n log n) time, and using
linear space, such that given
a query point q and parameter 1 ≥ ε > 0, one can return an (1 + ε)-ANN to q in O 1/εd + log Φ(P) time. In fact, the query time is
O(1/εd + log(diam(P)/$)), where $ = dP (q).

11.3 ANN – the unbounded general case

The Snark and the unbounded spread case. (Or a meta-philosophical pretentious discussion that the reader might
want to skip. The reader might consider this to be a footnote of a 1footnote, which finds itself inside the text because of lack of
space in the bottom of the page.) We have a data-structure that supports insertions, deletions and approximate nearest neighbor
reasonably quickly. The running time for such operations is roughly O(log Φ(P)) (ignoring additive terms in 1/ε). Since the spread
of P in most real world applications is going to be bounded by a constant degree polynomial in n, it seems this is sufficient for our
purposes, and we should stop now, while ahead in the game. But the nagging question remains: If the spread of P is not bounded
by something reasonable, what can be done?
The rule of thumb is that Φ(P) can always be replaced by n (for this problem, but also in a lot of other problems). This usually
requires some additional machinery, and sometimes this machinery is quite sophisticated and complicated. At times, the search
for the ultimate algorithm that can work for such “strange” inputs, looks like the Hunting of the Snark [Car76] – a futile waste of
energy looking for some imaginary top-of-the-mountain, which has no practical importance. (At times the resulting solution is so
complicated, it feels like a Boojum [Car76].)
However, solving the bounded spread case can be acceptable in many situations, and it is the first stop in trying to solve the
general case. Furthermore, solving the general case provide us with more insights on the problem, and in some cases leads to more
efficient solutions than the bounded spread case.
With this caveat emptor warning duly given, we plunge ahead into solving the ANN for the unbounded spread case.

Plan of attack. To answer ANN query in the general case, we will first get a fast rough approximation. Next, using a
compressed quadtree, we would find a constant number of relevant nodes, and apply Theorem 11.2.3 to those nodes. This would
yields the required approximation. Before solving this problem, we need a minor extension of the compressed quadtree data-
structure.

11.3.1 Extending a compressed quadtree to support cell queries

b be a canonical grid cell (we remind the reader that this is a cell of the grid G2−i , for some integer i ≤ 0). Given a compressed
Let 2
quadtree T b, we would like to find the single node v ∈ Tb, such that P ∩ 2
b = Pv . We will refer to such query as a cell query.
It is not hard to see that the quadtree data-structure can be modified to support cell queries in logarithmic time (its essentially
a glorified point-location query), and we omit the easy but tedious details. See Exercise 2.7.3.

Lemma 11.3.1 One can perform a call query in a compressed quadtree T b, in O(log n) time, where n is the size of T
b. Namely, given
a query canonical cell 2 b such that 2w ⊆ 2
b, one can find, in O(log n) time, the node w ∈ T b and P ∩ 2b = Pw .

11.3.2 Putting things together

Let P be a set of n points in IRd contained in the unit hypercube. We build the compressed quadtree T b of P, so that it supports cell
queries, using Lemma 11.3.1. We will also need a data-structure that supports very rough ANN. We describe several ways to build
such a data-structure in the next section, and in particular, we will use the following result (see Theorem 11.4.7).

Lemma 11.3.2 Let P be a set of n points in IRd . One can build a data structure TR , in O(n log n) time, such that given a query point
q ∈ IRd , one can return a (1 + 4n)-ANN of q in P in O(log n) time.
Given a query point q, using TR , we compute a point u ∈ P, such that $ ≤ ku − qk ≤ (1 + 4n)$, where $ = dP (q). Let

R = ku − qk and r = ku − qk /(4n + 1). Clearly, r ≤ $ ≤ R. Next, compute ` = lg R , and let C be the set of cells of G2` that are
` S
in distance ≤ R from q. Clearly, since R ≤ 2 , it follows that b q ∈ 2∈C 2, where b q is the nearest neighbor to q in P. For each cell
2 ∈ C, we compute the node v ∈ T b such that P ∩ 2 = Pv , using a cell query (i.e., Lemma 11.3.1). Let V be the resulting set of
nodes of Tb.
For each node of v ∈ V, we now apply the algorithm of Theorem 11.2.3 to the compressed quadtree rooted at v. Since
|V| = O(1), and diam(Pv ) = O(R), for all v ∈ V, the query time is
!    
X 1 diam(Pv )  1 X diam(Pv )   1 X R 
O d + log 
= O d + log 
 = O d + log 
v∈V
ε r ε v∈V
r ε v∈V
r
!
1
= O d + log n .
ε

117
As for the correctness of the algorithm, notice that there is a node w ∈ V, such that b
q ∈ Pw . As such, when we apply the
algorithm of Theorem 11.2.3 to w, it would return us a (1 + ε)-ANN to q.

Theorem 11.3.3 Let P be a set of n points in IRd . One can construct a data-structure of linear size, in O(n log n) time, such that
given a query point q ∈ IRd , and a parameter 1 ≥ ε > 0, one can compute a (1 + ε)-ANN to q in O(1/εd + log n) time.

11.4 Low Quality ANN Search

To perform ANN in the unbounded spread case, all we need is a rough approximation (i.e., polynomial factor in n) to the distance
to the nearest-neighbor (note that we need only the distance). We present two different ways to get this rough ANN.

11.4.1 Low Quality ANN Search - Point Location with Random Shifting
11.4.1.1 The data-structure and search procedure
→−
Let P be a set of n points in IRd , contained
inside
the square [0, 1/2]d . Let v be a random vector in the square [0, 1/2]d , and consider
→− →−
the point set Q = P + v = p + v p ∈ P . Clearly, given a query point q, we can answer the ANN on P, by answering the ANN
→− b build for Q.
query q + v on Q. Note, that Q is contained inside the unit cube, and consider the compressed quadtree T
Given a query point q, let v be the node of T b that its region rgv contains q0 = q + →
−
v . If rgv is a cube (i.e., v is a leaf) and the v
stores a point p ∈ Q inside it, then we return kq0 − pk as the distance to the ANN. If v does not store a point of Q, then we return
2diam(rgv ) as the distance to the ANN.
Things get more exciting if v is a compressed node. In this case, there is no point of Q associated with v (by construction). Let
w be its only child, and return d(rgw , q0 ) + diam(rgv ) as the distance to the ANN.

11.4.1.2 Proof of correctness

−`
Lemma
11.4.1 Let I =[α, β] be an interval on the real line, and let r = 2 be a given real number, where ` ≥ 1 is an integer. Let

U = ir i is an integer , and consider a random number x ∈ [0, 1/2]. Then, the probability that I + x contains a point of U is (at
most) kIk /r.

Proof: If kIk ≥ r, then any translation of I contains a point of U, and the claim trivially holds.
Otherwise, let X be the set of real numbers such that if x ∈ X, then I + x contains a point of U. The set X is a repetitive set of
S
intervals. Formally, X = k ([−β, −α] + kr). As such, the total length of X inside the interval [0, 1/2] is

ρ=
kIk [0, 1/2] ,
r

since r is a power of 2. As such, the probability that a random point inside [0, 1/2] would fall inside X is ρ/ [0, 1/2] = kIk /r.

→−
Lemma 11.4.2 Let s be a segment of length ∆. Consider the randomly shifted segment s + v . The probability that s + v intersect
−`
the boundary of the canonical grid Gr is at most d∆/r, where r = 2 and ` ≥ 1.

Proof: Let Ii = {αi , βi } be the projection of s into the ith axis, for i = 1, . . . , d. Clearly, s+v intersects
the
separating hyperplanes

orthogonal to the ith dimension, if (the randomly shifted) interval Ii contains a point of U = ir i integer . By Lemma 11.4.1, the
→−
probability for that is at most δi /r, where δi = kIi k. Thus, the probability that s + v intersects Gr is bounded by

X
d
δi d∆
≤ ,
i=1
r r

as claimed.

Lemma 11.4.3 For any integer i, and a query point q, the above data-structure returns a 4ni approximation to the distance to the
nearest neighbor of q in P, with probability ≥ 1 − 2n−i+1 .

118
Proof: For the time being, it would be easier to consider T b to be a regular (i.e., not compressed) quadtree of the point set Q.
And let v be the leaf of T b that contains the query point q0 = q + → −
v . Let p = qb0 be the nearest neighbor to q0 in Q, and consider the
0 →− 0
segment s = qb q. Clearly, s = s + v = q p.
Now, if s0 is completely contained in the leaf v of T b that contains q0 , then p is stored in v, and we return ks0 k as the distance to
the ANN. √ √
If s0 intersects the boundary of the leaf v, let r be the side length of rgv . Observe, that r ≥ ksk / d. Indeed, if r ≤ ksk / d then
rgv ⊆ b(q0 , ksk), but the interior of b(q0 , ksk) is empty of any points of Q. Namely, the region of v will not be further refined by the
construction algorithm. As such, the result returned by the algorithm is never too small, it can only be too large.
In particular, if the side length of the leaf that contains q0 is r, then s0 intersects the boundary of the grid Gr , and the probability
for that to happened is at most d$/r, by Lemma 11.4.2. Thus, let x be the smallest power of two which is larger than ni $. We have
that
h √ i h i X ∞
d$ X d$
∞
2
Pr dist. returned ≥ 2 dni $ ≤ Pr r ≥ ni $ ≤ ≤ ≤ i−1 .
i=0
2i x i=0
2i ni $ n

b, and observe that there are

Thus, the only case that remains, is when v is a compressed node. Let w be the only child of v in T
points of Q stored in rgw , which implies that the answer returned is indeed an upper bound on the distance to the ANN. There are
several cases:
• If p = qb0 is outside the bounding cube of v. Then s0 intersects the outer boundary of rgv , and its easy to argue that the
answer returned is shorter than the 2diam(rgv ), which implies by the above analysis, that the answer returned is ANN with
the required bounds.
• If p ∈ Qw , then there are two cases:

(i) If d(q0 , rgw ) > diam rgw /ni then clearly, the answer returned is a 2ni -approximation to the ANN, since kq0 pk ≥

d q0 , rgw .

(ii) If d(q0 , rgw ) < diam rgw /ni then q0 p intersects the boundary of rgw . This is the same segment we would have intersected
b was not compressed, and as such the analysis for the uncompressed quadtree implies the claim.
if T

11.4.2 Low Quality ANN Search - The Ring Separator Tree

Definition 11.4.4 A binary tree T having the points of P as leaves, is a t-ring tree for P, if every node v ∈ T , is associated with a
ring (hopefully “thick”), such that the ring separates the points into two sets (hopefully both relatively large), and the interior of the
ring is empty of any point of P.
For a node v of T, let Pv denote the subset of points of P stored in the subtree of v in T,
and let pv be a point stored in v. We require, that for any node v of T, there is an associated ball
bv = b(pv , rv ), such that all the points of Pvin = Pv ∩ bv are in one child of T. Furthermore, all the
other points of Pv are outside the interior of the enlarged ball b(pv , (1 + t)rv ), and are stored in the pv
other child of v.
We will also store an arbitrary representative point repv ∈ Pvin in v. repv

Namely, if T is a t-ring tree, then for any node v ∈ T, the interior of the ring b(pv , (1 + t)rv ) \ b(pv , rv ) is empty of any point of
P. Intuitively, the bigger t is, the better T clusters P.

The ANN search procedure. Let q denote the query point. Initially, set v to be the root of T, and rcurr ← ∞. The algorithm
answer the ANN query by traversing down T.
During the traversal, we first compute the distance l = q − repv . If this is smaller than rcurr (the distance to the current nearest
neighbor found) then we update rcurr (and store the point realizing the new value of rcurr ).
r, we continue the search recursively in the child containing Pvin , where b
If kq − pv k ≤ b r = (1 + t/2)rv is the “middle” radius
of the ring. Otherwise, we continue the search in the subtree containing Pvout . The algorithm stops when reaching a leaf of T, and
returns the point realizing rcurr is the approximate nearest neighbor.

119
Intuition. If the query point q is outside the outer ball of a node v, it is so far from the points inside
inner ball (i.e., Pvin ), and we can treat all of them as a single point (i.e., repv ). On the other hand, q
if the query point q0 is inside the inner ball, then it must have a neighbor nearby (i.e., a point of
Pvin ), and all the points of Pvout are far enough away that they can be ignored. Naturally, if the query
point falls inside the ring, the same argumentation works (with slightly worst constants), using the pv
b
r q0
middle radius as the splitting boundary in the search. See figure on the right.

Lemma 11.4.5 Given a t-ring tree T, one can answer (1 + 4/t)-approximate nearest neighbor
queries, in O(depth(T)) time.

Proof: Clearly, the query time is O(depth(T)). As for the quality of approximation, let π denote the generated search path in T
and b
q denote the nearest neighbor to q in P. Furthermore, let w denote the last node in the search path π, such that bq ∈ Pw . Clearly,
q ∈ Pwin , but we continued the search in Pwout , then q is outside the middle sphere, and q − b
if b q ≥ (t/2)rw (since this is the distance
between the middle sphere and the inner sphere). Thus,
q − rep ≤ q − b
q + b

q − repw ≤ q − b

q + 2rw ,
w

since b
q, repw ∈ bw = b(pw , rw ). In particular,

q − repw q − bq + 2rw 4
≤ ≤1+ .
q
q − b q − bq t

Namely, repw is a (1 + 4/t)-approximate nearest neighbor to q.

Similarly, if bq ∈ Pwout , but we continued the search in Pwin , then q − b q ≥ (t/2)rw and

q − repw ≤ kq − pw k + pw − repw ≤ (1 + t/2)rw + rw = (2 + t/2)rw .
(2+t/2)rw
Thus, repw is a (t/2)rw
-ANN of q. Namely, repw is a (1 + 4/t)-ANN of q.

In low dimensions, there is always a good separating ring. Indeed, consider the ≥ n/2
smallest ball b = b(p, r) that contains n/c1 points of P, where c1 is a sufficiently large
constant. Let b0 be the scaling of this ball by a factor of two. By a standard packing
argument, the ring b0 \ b can be covered with c = O(1) copies of b, none of which
can contain more than n/c1 points of P. It follows, that by picking c1 = 3c, we are
guaranteed that at least half the points of P are outside b0 . Now, the ring can be split b
into n/2 empty rings (by taking a sphere that passes through each point inside the ring),
≤ n/c1
and one of them would be of thickness at least r/n, and it would separate n/c points of
P from the outer n/2 of P. Doing this efficiently requires trading off some constants,
and some tedious details, as described in the following lemma.

Lemma 11.4.6 Given a set P of n points in IRd , one can compute a (1/n)-ring tree of b0
P in O(n log n) time.

Proof: The construction is recursive. Compute the ball D = b(p, r) that contains ≥ n/c points of P, such that r ≤ 2ropt (P, n/c),
where c is a constant to be determined shortly. We remind the reader that ropt (P, n/c) is the radius of the smallest ball that contains
n/c points of P, and the ball D can be computed in linear time, by Lemma 1.3.1. Consider the ball D0 of radius 2r centered at p.
The ball D0 can be covered l by
√ ma hypercube S with side length 4r. Furthermore,√ partition S into a grid such that every cell is of
side length r/L, for L = 16 d . Every cell in this grid has diameter ≤ 4r d/L ≤ r/4 ≤ ropt (P, n/c)/2. Thus, every grid cell can
d
4r
contain at most n/c points, since it can be covered with a ball of radius ropt (P, n/c)/4. There are M = r/L = (4L)d grid cells.
0 d d 0
Thus, D contains at most (4L) (n/c) points. Specifically, for c = 2(4L) the ball D contains at most n/2 points of P.
In particular, there must be a radius r0 such that r ≤ r0 ≤ 2r, and there is a h ≥ r/n, such that the ring b(p, r0 + h) \ b(p, r0 ) does
not contain any points of P in its interior.
Indeed, sort the points of P inside D0 \ D by their distances from p. There are n/2 numbers in the range of distances [r, 2r]. As
such, there must be an interval of length r/(n/2 + 1) which is empty. And this empty range, corresponds to the empty ring.
Computing r0 and h is done by computing the distance of each point from p, and partitioning the distance range [r, 2r] into 2n
equal length segments. In each segment, we register the point with minimum and maximum distance from c in this range. This can
be done in linear time using the floor function. Next, scan those buckets from left to right. Observe, that the maximum length gap
is realized by a maximum of one bucket together with a consecutive sequence of empty buckets, ending by the minimum of a non
empty bucket. As such, the maximum length interval can be computed in linear time, and yield r0 and h.

120
Thus, let v be the root of the new tree, set Pvin to be P ∩ b(p, 0 v v 0
r ) and Pout = P \ Pin , store bv = b(p, r ) and pv = p. Continue the

construction recursively on those two sets. Observe that Pin , Pout ≥ n/c, where c is a constant. It follows that the construction
v v

time of the algorithm is T (n) = O(n) + T Pvin + T Pvout = O(n log n), and the depth of the resulting tree is O(log n).
Combining the above two lemmas, we get the following result.

Theorem 11.4.7 Let P be a set of n points in IRd . One can preprocess it in O(n log n) time, such that given a query point q ∈ IRd ,
one can return a (1 + 4n)-ANN of q in P in O(log n) time.

11.5 Bibliographical notes

The presentation of the ring tree follows the recent work of Har-Peled and Mendel [HM06]. Ring trees are probably an old idea.
A more elaborate but similar data-structure is described by Indyk and Motwani [IM98]. Of course, the property that “thick” ring
separators exist, is inherently low dimensional, as the regular simplex in n dimensions demonstrates. One option is to allow the
rings to contain points, and replicate the points inside the ring in both subtrees. As such, the size of the resulting tree is not
necessarily linear. However, careful implementation yields linear (or small) size, see Exercise 11.6.1 for more details. This and
several additional ideas are used in the construction of the cover tree of Indyk and Motwani [IM98].
Section 11.2 is a simplification of Arya et al. [AMN+ 98] work. Section 11.3 is also inspired to a certain extent by Arya
et al. work, although it is essentially a simplification of Har-Peled and Mendel [HM06] data-structure to the case of compressed
quadtrees. In particular, we believe that the data-structure presented is conceptually simpler than previously published work.
There is a huge literature on approximate nearest neighbor search, both in low and high dimensions in the theory, learning and
database communities. The reason for this huge work lies in the importance of this problem, special input distributions, different
computation models (i.e., I/O-efficient algorithms), search in high-dimensions, and practical efficiency.

Liner space. In the low dimensions, the seminal work of Arya et al. [AMN+ 98], mentioned above, was the first to offer linear
size data-structure, with logarithmic query time, such that the approximation quality is specified with the query. The query time
of Arya et al. is slightly worse than the running time of Theorem 11.3.3, since they maintain a heap of cells, always handling the
cell closest to the query point. This results in query time O(ε−d log n). It can be further improved to O(1/εd log(1/ε) + log n) by
observing that this heap has only very few delete-min, and many insertions. This observation is due to Duncan [Dun99].
Instead of having a separate ring-tree, Arya et al. rebalance the compressed quadtree directly. This results in nodes, which
correspond to cells that have the shape of an annulus (i.e., the region formed by the difference between two canonical grid cells).
Duncan [Dun99] and some other authors offered data-structure (called the BAR-tree) with similar query time, but it is seems
to be inferior, in practice, to Arya et al. work, for the reason that while the regions the nodes correspond to are convex, they have
higher descriptive complexity, and it is harder to compute the distance of the query point to a cell.

Faster query time. One can improve the query time if one is willing to specify ε during the construction of the data-structure,
resulting in a trade off between space for query time. In particular, Clarkson [Cla94] showed that one can construct a data-structure
of (roughly) size O(n/ε(d−1)/2 ), and query time O(ε−(d−1)/2 log n). Chan simplified and cleaned up this result [Cha98] and presented
also some other results.
√ √
Details on Faster Query Time. A set of points Q is ε-far from a query point q, if the p − cQ ≥ diam(Q)/ ε,
where cQ is some point of Q. It is easy to verify that if we partition space around cQ into cones with central angle
√
O( ε) (this requires O(1/ε(d−1)/2 ) cones), then the most extreme point of Q in such a cone ψ, furthest away from cQ ,
√
is the (1 + ε)-approximate nearest neighbor for any query point inside ψ which is ε-far. Namely, we precompute
the ANN inside each cone, if the point is far enough. Furthermore, by careful implementation (i.e., grid in the angles
space), we can decide, in constant time, which cone the query point lies in. Thus, using O(1/ε(d−1)/2 ) space, we can
√
answer (1 + ε)-ANN queries for q, if the query point is ε-far, in constant time.
Next, construct this data-structure for every set Pv , for v ∈ T b(P). This results in a data-structure of size
(d−1)/2
O(n/ε ). Given a query point q, we use the algorithm of Theorem 11.3.3, and stop as soon as for a node v,
√
Pv is ε-far, and then we use the secondary data-structure for Pv . It is easy to verify that the algorithm would stop as
√
soon as diam(2v ) = O( εdP (q). As such, the number of nodes visited would be O(log n + 1/εd/2 ), and identical query
time.
Note, that we omitted the construction time (which requires some additional work to be done efficiently), and our
query time is slightly worse than the best known. The interested reader can check out the work by Chan [Cha98],
which is somewhat more complicated than what is outlined here.

The first to achieve O(log(n/ε)) query time (using near linear space), was Har-Peled [Har01b], using space roughly O(nε−d log2 n).
This was later simplified and improved by Arya and Malamatos [AM02], which present a data-structure with the same query time,

121
and of size O(n/εd ). Those data-structure relies on the notion of computing approximate Voronoi diagrams and performing point
location queries in those diagrams. By extending the notion of approximate Voronoi diagrams, Arya, Mount and Malamatos
[AMM02] showed that one can answer (1 + ε)-ANN queries in O(log(n/ε)) time, using O(n/ε(d−1) ) space. On the other end of the
spectrum, they showed that one can construct a data-structure of size O(n) and query time O(log n + 1/ε(d−1)/2 ) (note, that for this
data-structure ε > 0 has to be specified in advance). In particular, the later result breaks a space/query time tradeoff that all other
results suffers from (i.e., the query time multiplied by the construction time has dependency of 1/εd on ε).

Practical Considerations Arya et al. [AMN+ 98] implemented their algorithm. For most inputs, it is essentially a kd-tree.
The code of their library was carefully optimized and is very efficient. In particular, in practice, I would expect it to beat most of
the algorithms mentioned above. The code of their implementation is available online as a library [AM98].

Higher Dimensions. All our results have exponential dependency on the dimension, in query and preprocessing time (al-
though the space can be probably be made subexponential with careful implementation). Getting a subexponential algorithms
requires a completely different techniques, and would be discussed in detail at some other point.

Stronger computation models. If one assume that the points have integer coordinates, in the range [1, U], then approx-
imate nearest-neighbor queries can be answered in (roughly) O(log log U + 1/εd ) time [AEIS99], or even O(log log(U/ε)) time
[Har01b]. The algorithm of Har-Peled [Har01b] relies on computing a compressed quadtree of height O(log(U/ε)), and performing
fast point-location query in it. This only requires using the floor function and hashing (note, that the algorithm of Theorem 11.3.3
uses the floor function and hashing during the construction, but it is not used during the query). In fact, if one is allowed to slightly
blowup the space (by a factor U δ , where δ > 0 is an arbitrary constant), the ANN query time can be improved to constant [HM04].
By shifting quadtrees, and creating d + 1 quadtrees, one can argue that the approximate nearest neighbor must lie in the same
cell (and of the “right” size) of the query point in one of those quadtrees. Next, one can map the points into a real number, by using
the natural space filling curve associated with each quadtree. This results in d + 1 lists of points. One can argue that a constant
approximate neighbor must be adjacent to the query point in one of those lists. This can be later improved into (1 + ε)-ANN by
spreading 1/εd points. This simple algorithm is due to Chan [Cha02].
The reader might wonder why we bothered with a considerably more involved algorithm. There are several reasons: (i) This
algorithm requires the numbers to be integers of limited length (i.e., O(log U) bits), and (ii) it requires shuffling of bits on those
integers (i.e., for computing the inverse of the space filling curve) in constant time, and (iii) the assumption is that one can combine
d such integers into a single integer and perform XOR on their bits in constant time. The last two assumptions are not reasonable
when the input is made out of floating point numbers.

Further research. At least (and only) in low dimensions, the ANN problem seems to be essentially solved both in theory
and practice (such proclamations are inherently dangerous, and should be taken with considerable amount of healthy skepticism).
Indeed, for ε > 1/ log1/d n, the current data structure of Theorem 11.3.3 provide logarithmic query time. Thus, ε has to be quite
small for the query time to become bad enough that one would wish to speed it up.
Main directions for further research seems to be working on this problem in higher dimensions, and solving it in other compu-
tation models.

Surveys. A survey on approximate nearest neighbor search in high dimensions is by Indyk [Ind04]. In low dimensions, there is
a survey by Arya and Mount [AM04].

11.6 Exercises
Exercise 11.6.1 (Better Ring Tree) [10 Points]
Let P be a set of n points in IRd . Show how to build a ring tree, of linear size, that can answer O(log n)-ANN queries in O(log n)
time. [Hint: Show, that there is always a ring containing O(n/ log n) points, such that it is of width w, and its interior radius is
O(w log n). Next, build a ring tree, replicating the points in both children of this ring node. Argue that the size of the resulting tree
is linear, and prove the claimed bound on the query time and quality of approximation.]

122
Chapter 12

Approximate Nearest Neighbor via

Point-Location among Balls

Today I know that everything watches, that nothing goes unseen, and that even wallpaper has a better memory than
ours. It isn’t God in His heaven that sees all. A kitchen chair, a coat-hanger a half-filled ash tray, or the wood replica
of a woman name Niobe, can perfectly well serve as an unforgetting witness to every one of our acts.
– – The tin drum, Gunter Grass

12.1 Hierarchical Representation of Points

In the following, it would be convenient to carry out our discussion in a more generalized settings than just low dimensional
Euclidean space.

Definition 12.1.1 A metric space M is a pair (X, d) where X is a set and d : X × X → [0, ∞) is a metric, satisfying the following
axioms: (i) d(x, y) = 0 iff x = y, (ii) d(x, y) = d(y, x), and (iii) d(x, y) + d(y, z) ≥ d(x, z) (triangle inequality).

For example, IR2 with the regular euclidean distance is a metric space.
In the following, we are going to assume that we are provided with a black box, such that given two points x, y ∈ X, we can
compute the distances d(x, y) in constant time.

12.1.1 Low Quality Approximation by HST

We will use the following special type of metric spaces:

Definition 12.1.2 Let P be a set of elements, and T a tree having the elements for P as leaves. The tree T defines a hierarchically
well-separated tree (HST) over the points of P, if to each vertex u ∈ T there is associated a label ∆u ≥ 0, such that ∆u = 0 if and
only if u is a leaf of T . The labels are such that if a vertex u is a child of a vertex v then ∆u ≤ ∆v . The distance between two leaves
x, y ∈ T is defined as ∆lca(x,y) , where lca(x, y) is the least common ancestor of x and y in T .
If every internal node of T has exactly two children, we will refer to it as being a binary HST (BHST).

For convenience, we will assume that the underlying tree is binary (any HST can be converted to binary HST in linear time,
while retaining the underlying distances).
We will also associate
with every vertex u ∈ T , an arbitrary leaf repu of the subtree rooted

at u. We also require that rep ∈ rep v is a child of u .
u v

A metric N is said to t-approximate the metric M, if they are on the same set of points, and dM (u, v) ≤ dN (u, v) ≤ t · dM (u, v),
for any u, v ∈ M.
It is not hard to see that any n-point metric is (n − 1)-approximated by some HST.

Lemma 12.1.3 Given a weighted connected graph G on n vertices and m edges, it is possible to construct, in O(n log n + m) time,
a binary HST H that (n − 1)-approximates the shortest path metric on G.

123
Proof: Compute the minimum spanning tree of G in O(n log n + m) time, and let T denote this tree.
Sort the edges of T in non-decreasing order, and add them to the graph one by one. The HST is built bottom up. At each
point we have a collection of HSTs, each corresponds to a connected component of the current graph. When an added edge merges
two connected components, we merge the two corresponding HSTs into one by adding a new common root for the two HST, and
labeling this root with the edge’s weight times n − 1. This algorithm is only slight variation on Kruskal algorithm, and has the same
running time.
We next estimate the approximation factor. Let x, y be two vertices of G. Denote by e the first edge that was added in the
process above that made x and y in the same connected component C. Note that at that point of time e is the heaviest edge in C, so
w(e) ≤ dG (x, y) ≤ (|C| − 1) w(e) ≤ (n − 1) w(e). Since dH (x, y) = (n − 1) w(e), we are done.

Corollary 12.1.4 For a set P of n points in IRd , one can construct, in O(n log n) time, a BHST H that (2n − 2)-approximates the
distances of points in P. That is, for any p, q ∈ P, we have dH (p, q)/(2n − 2) ≤ kpqk ≤ dH (p, q).

Proof: We remind the reader, that in IRd , one can compute a 2-spanner for P of size O(n), in O(n log n) time (see Theorem 3.2.1).
Let G be this spanner, and apply Lemma 12.1.3 to this spanner. Let H be the resulting metric. For any p, q ∈ P, we have
kpqk ≤ dH (p, q) ≤ (n − 1)dG (p, q) ≤ 2(n − 1) kpqk.
Corollary 12.1.4 is unique to IRd since for general metric spaces, no HST can be computed in subquadratic time, see Exer-
cise 12.5.1.

Corollary 12.1.5 For a set P of n points in a metric space M, one can compute a HST H that (n − 1)-approximates the metric dM .

12.2 ANN using Point-Location Among Balls

In the following, let P be a set of n points in M, where M is a metric space.
[
Definition 12.2.1 For a set of balls B such that b = M (i.e., one of the balls might be of infinite radius and it covers the whole
b∈B
space M), and a query point q ∈ M, the target ball of q in B, denoted by B (q), is the smallest ball of B that contains q (if several
equal radius balls contain q we resolve this in an arbitrary fashion).

Our objective, is to show that (1 + ε)-ANN queries can be reduced to target ball queries, among a near linear size set of balls.
But let us first start from a “silly” result, to get some intuition about the problem.

d
Lemma 12.2.2 Let B = ∪∞ i
i=−∞ B(P, (1 + ε) ), where B(P, r) = ∪ p∈P b(p, r). For q ∈ IR , let b = B (q), and let p ∈ P be the center
of b. Then, p is (1 + ε)-ANN to q.

q be the nearest neighbor to q in P, and let r = dP (q). Let i be such that (1 + ε)i < r ≤ (1 + ε)i+1 . Clearly,
Proof: Let b
radius(b) > (1 + ε)i . On the other hand, p ∈ b(b
q, (1 + ε)i+1 ). It follows that, kqpk ≤ (1 + ε)i+1 ≤ (1 + ε)dP (q). Implying that p is
(1 + ε)-ANN to P.

Remark 12.2.3 Here is an intuitive construction of a set of balls of polynomial size, such that target queries answer (1 + ε)-ANN
correctly. Indeed, consider two points u, v ∈ P. As far as correctness, we care if the ANN returns either u or v, for a query point q,
only if dP (q) ∈ [dM (u, v)/4, 2dM (u, v)/ε] (for shorter distances, either u or v are the unique ANN, and for longer distances either
one of them is ANN for the set {u, v}).
Next, consider a range of distances [(1 + ε)i , (1 + ε)i+1 ] to be active, if there is u, v ∈ P, such that ε(1 + ε)i ≤ dM (u, v) ≤
4(1 + ε)i+1 /ε. Clearly, the number of active intervals is O(n2 ε−1 log(1/ε)) (one can prove a better bound). Generate a ball for each
point of P, for each active range. Clearly, the resulting number of balls is polynomial, and can be used to resolve ANN queries.

Getting the number of balls to be near linear, requires to be more careful, and the details are provided below.

12.2.1 Handling a Range of Distances

Definition 12.2.4 For a real number r > 0, a near-neighbor data structure, denoted by NNbr = NNbr(P, r), is a data-structure,
such that given a query point q, it can decide whether dP (q) ≤ r, or dP (q) > r. If dP (q) ≤ r, it also returns as a witness a point
p ∈ P, such that d(p, q) ≤ r.
The data-structure NNbr(P, r) can be realized by just a set n balls around the points of P of radius r, and performing target ball
queries on this set. For the time being, the reader can consider NNbr(P, r) as just being this set of n balls.

124
Definition 12.2.5 One can in fact, resolve ANN queries on a range of distances, [a, b], by building NNbr data structures, with

exponential jumps on this range. Formally, let Ni = NNbr(P, ri ), where ri = (1+ε)i a, for i = 0, . . . , M −1, where M = log1+ε (b/a) .
And let N M = NNbr(P, M), where r M = b. We will denote this set of data-structures by I(P,b a, b, ε) = {N0 , . . . , N M }. We refer to I
b
is interval near-neighbor data structure.

Lemma 12.2.6 Given P as above, and parameters a ≤ b and ε > 0. We have: (i) I(P, b a, b, ε) is made out of O(ε−1 log(b/a)) NNbr
b −1
data structures, and (ii) I(P, a, b, ε) contains O(ε n log(b/a)) balls overall.
Furthermore, one can decide if either of the following options holds: (i) dP (q) < a, (ii) dP (q) > b, (iii) or return a number r and
a point p ∈ P, such that kpqk ≤ dP (q) ≤ (1 + ε) kpqk. This requires two NNbr queries if (i) or (ii) holds, and O(log(ε−1 log(b/a)))
otherwise.

Proof: Given a query point q, we first check if dP (q) ≤ a, by querying N0 , and if so, the algorithm returns “dP (q) ≤ a”.
Otherwise, we check if dP (q) > b, by querying N M , and if so, the algorithm returns “dP (q) > b”.
Otherwise, let Xi = 1 if and only if dP (q) ≤ ri , for i = 0, . . . , m. We can determine the value of Xi by performing a query in
the data-structure Ni . Clearly, X0 , X1 , . . . , X M is a sequence of zeros, followed by a sequence of ones. As such, we can find the i,
such that Xi = 0 and Xi+1 = 1, by performing a binary search. This would require O(log M) queries. In this case, we have that
ri < dP (q) ≤ ri+1 ≤ (1 + ε)ri . Namely, the ball in NNbr(P, ri+1 ) covering q, corresponds to (1 + ε)-ANN to q.
To get the state bounds, observe that by the Taylor’s expansion ln(1 + x) = x − x2 /2 + x3 /3 + · · · ≥ x/2, for x ≥ 1. Thus,
!
ln(b/a)
M = log1+ε (b/a) = O = O log(b/a)/ε ,
ln(1 + ε)
since 1 ≥ ε > 0.

Corollary 12.2.7 Let P be a set of n points in M, and let a < b be real numbers. For a query point q ∈ M, such that dP (q) ∈ [a, b],
the target query over the set of balls NNbr(P, a, b, ε) returns a ball centered at (1 + ε)-ANN to q.

Lemma 12.2.6 implies that we can “cheaply” resolve (1 + ε)-ANN over intervals which are not too long.
S
Definition 12.2.8 For a set P of n points in a metric space M, let Uballs (P, r) = p∈P b(p, r) denote the union of balls of radius r
around the points of P.

Lemma 12.2.9 Let Q be a set of m points in M and r > 0 a real number, such that Uballs (Q, r) is a connected set. Then: (i) Any
two points p, q ∈ Q are in distance ≤ 2r(m − 1) from each other. (ii) For q ∈ M a query point, such that dQ (q) > 2mr/δ, then any
point of Q is (1 + δ)-ANN of q.

Proof: (i) Since Uballs (Q, r) is a connected set, there is a path of length ≤ (m − 1)2r between any two points x, y of P. Indeed,
consider the graph G, which connects two vertices u, v ∈ P, if dM (u, v) ≤ 2r. Since Uballs (Q, r) is a connected set, it follows that
G is connected. As such, there is a path of length m − 1 between x and y in G. This corresponds to a path of length ≤ (m − 1)2r
connecting x to y in M, and this path lies inside Uballs (Q, r).
(ii) For any p ∈ Q, we have
2mr
≤ dM (q, p) ≤ dM (q,b
q) + dM (b
q, p) ≤ dM (q,b
q) + 2mr ≤ (1 + δ)dM (q,b
q),
δ
where bq is the nearest-neighbor of q in Q.
Lemma 12.2.9 implies that for faraway query points, a cluster points Q which are close together can be treated as a single
point.

12.2.2 The General Case

Theorem 12.2.10 Given a set P of n points in M, then one can construct data-structures D that answers (1 + ε)-ANN queries, by
performing O(log(n/ε)) NNbr queries. The total number of balls stored at D is O(nε−1 log(n/ε)).
Let B be the set of all the balls stored at D. Then a target query on B answers (1 + ε)-ANN query.

Proof: We are going to build a tree D, such that each node v would have an interval near-neighbor data-structure I bv associated
with it. As we traverse down the tree, we will use those data-structure to decide to what child to continue the search into.
Compute the minimum value r > 0 such that Uballs (P, r) is made out of dn/2e connected components. We set I broot(T ) =
b
I(P, r, R, ε/4), where R = 2cµnr/ε, µ is a global parameter and c > 1 is an appropriate constant, both to be determined shortly. For
each connected component C of Uballs (P, r), we build recursively a tree for C ∩ P (i.e., the points corresponding to C), and hung it
on root(T ). Furthermore, from each such connected component C, we pick one representative point q ∈ C ∩ P, and let Q be this set

125
of points. We also build (recursively) a tree for Q, and hang it on root(T ). We will refer to the child of root(T ) corresponding to Q,
as the outer child of root(T ).
Given a query point q ∈ M, use I broot(T ) = I(P,
b r, R, ε/4) to determine, if dP (q) ≤ r. If so, we continue the search recursively in
the relevant child built for the connected component of Uballs (P, r) containing q (we know which connected component it is, because
broot(T ) also returns a point of P in distance ≤ r from q). If dP (q) ∈ [r, R], then we will find its (1 + ε)-ANN from I
I broot(T ) . ‘Otherwise,
dP (q) > R, and we continue the search recursively in the outer child of root(T ).
Observe, that in any case, we continue the search on a set of balls of size ≤ n/2 + 1. As such, after number of steps ≤ log3/2 n
the search halts.
correctness. If during the search the algorithm traverse from a node v down to one of the connected components which is a
child of Uballs (Pv , rv ), then no error is introduced, where Pv is the point set used in constructing v, and rv is the value of r used in the
construction. If the query is resolved by I bv , then a (1 + ε/4) error is introduced into the quality of approximation. If the algorithm
continues the search in the outer child of v, then an error of 1 + δv is introduced in the answer, where δv = ε/(cµ), by Lemma 12.2.9
(ii). Thus, the overall quality of the ANN returned in the worst case is

! !
ε Y ε logY
log3/2 n 3/2 n
ε cε
t ≤ 1+ 1+ ≤ exp exp ,
4 i=1 cµ 4 i=1 cµ
l m Plog n
since x ≤ e x for x ≤ 1. Thus, setting µ = log3/2 n and c to be a sufficiently large constant, we have t ≤ exp ε/4 + i=13/2 ε/cµ ≤
exp(ε/2) ≤ 1 + ε, since ε < 1/2. We used the fact that e x ≤ (1 + 2x) for x ≤ 1/2, as can be easily verified.
Number of queries. As the search algorithm proceeds down the tree D, at most two NNbr queries are performed at each
node. At last node of the traversal, the algorithm performs O(log(ε−1 log(n/ε))) = O(log(n/ε)) queries, by Lemma 12.2.6
Number of balls. We need a new interpretation of the construction algorithm. In particular, let H be the HST constructed
for P using the exact distances. It is easy to observe, that a connected component of Uballs (P, r) is represented by a node v and its
subtree in H. In particular, the recursive construction for each connected component, is essentially calling the algorithm recursively
on a subtree of H. Let V be the set of nodes of H which represent connected components of Uballs (P, r).
Similarly, the outer recursive
call, can be charged to the upper tree of H, having the nodes of V for leafs. Indeed, the outer set
of points, is the set rep(V) = repv v ∈ V . Let b L be this collection of subtrees of H. Clearly, those subtrees are not disjoint in

their vertices, but they are disjoint in their edges. The total number of edges is O(n), and b L ≥ n/2.
Namely, we can interpret the algorithm (somewhat counter intuitively) as working on H. At every stage, we break the current
HST into subtrees, and recursively continue the construction on those connected subtrees.
In particular, for a node v ∈ T , let nv be the number of children of v. Clearly, |Pv | = O(nv ), and since we can charge each such
P
child to the fact that we are disconnecting the edges of H from each other, we have that vinD nv = O(n).
At a node v ∈ D, we have that the data-structure I bv requires storing mv = O(ε |Pv | log(Rv /rv )) = O(ε−1 nv log(µnv /ε)) balls.
−1
−1
In particular, mv = O(ε nv log(µnv /ε)). We conclude, that the overall number of balls stored in D is
X nv !
µnv n n log n n n
O log = O log = O log .
v∈D
ε ε ε ε ε ε

A single target query. The claim that a target query on B answers (1 + ε)-ANN query on P, follows by an inductive proof
on the algorithm execution. Indeed, if the algorithm is at node v, and let Bv be the set of balls stored in the subtree of v. We claim,
that if the algorithm continue the search in w, then all the balls of U = Bv \ Bw are not relevant for the target query.
Indeed, if w is the outer child of v, then all the balls in U are too small, and none of them contains q. Otherwise, q ∈
Uballs (Pv , rV ). As such, all the balls of Bv that are bigger than the balls of Uballs (Pv , rv ), and as such they can be ignored. Furthermore,
all the other balls stored in other children of v, are of radius ≤ rv , and are not in the same connected component as Uballs (Pw , rv ) in
Uballs (Pv , rv ), as such, none of them is relevant for the target query.
The only other case, is when the algorithm stop the search at v. But at this case, we have rv ≤ dP (q) ≤ Rv , and then all the balls
in the children of v are either too big (i.e., the balls stored at the outer child are of radius > Rv ), or too small (i.e., the balls stored at
the regular children are of radius < rv ). Thus, only the balls of I(P b v , rv , Rv , ε/4) are relevant, and there we know that the returend
ball is a (1 + ε/4)-ANN by Corollary 12.2.7.
Thus, the point returned by the target query on B, is identical to running the search algorithm on D, and as such, by the above
prove, the result is correct.

12.2.2.1 Efficient Construction

Theorem 12.2.10 does not provide any bounds on the construction time, since it requires quadratic time in the worst case.

126
Lemma 12.2.11 Given a set P of n points in M and H be a HST of P that t-approximates M. Then one can construct data-
structures D that answers (1 + ε)-ANN queries, by performing O(log(n/ε)) NNbr queries. The total number of balls stored at D is
O(nε−1 log(tn/ε)).
The construction time is O(nε−1 log(tn/ε)).

Proof:
We reimplement the algorithmof Theorem 12.2.10, by doing the decomposition directly on the HST H. Indeed, let
U = ∆v v ∈ H, and v is an internal node . Let ` be the median value of U. Let V be the set of all the nodes v of H, such that
b = I(P,
v ∈ V, if ∆v ≤ ` and ∆p(v) > `. Next, build an I b r, R), where

r = `/(2tn) and R = (2cnr log n)/ε, (12.1)

where c is a large enough constant. As in the algorithm of Theorem 12.2.10, the set V breaks H into dn/2e + 1 subtrees, and
we continue the construction on each such connected component. In particular, for the new root node we create, set rroot(D) = r,
Rroot(D) = R, and `root(D) = `.
Observe, that for a query point q ∈ M, if dQ (p) ≤ r, then I b would return a ball, which in turn would correspond to a point
stored in one of the subtrees rooted at a node v ∈ V. Let C be the connected component of Uballs (P, r) that contains q, and let
Q = P ∩ C. It is easy to verify that Q ⊆ Pv . Indeed, since H t-approximates dM , and Q is a connected component of Uballs (P, r),
it follows that Q must be in the same connected component of H, when considering distances ≤ t · r < `. But such a connected
component of H, is no more than a node v ∈ H, and the points of P stored in the subtree v in H. However, such a node v is either
in V, or one of its ancestors is in V.
For a node v, the number of balls of IbV is O(ε−1 nv log((log n)nv y/ε)). Thus, the overall number of balls in the data-structure is
as claimed.
b data-structures, and as such the same bound holds.
As for the construction time, it is dominated by the size of I
Using Corollary 12.1.4 with Lemma 12.2.11 we get the following.

Theorem 12.2.12 Let P be a set of n points in IRd , and ε > 0 a parameter. One can compute an a set of O(nε−1 log(tn/ε)) balls
(the same bound holds for the running time), such that a ANN can be resolved by a target ball query on this set of balls.
Alternatively, one can construct a data-structure where (1 + ε)-ANN can be resolved by O(log(n/ε)) NNbr queries.

12.3 ANN using Point-Location Among Approximate Balls

While the results of Theorem 12.2.10 and Theorem 12.2.12 might be surprising, they can not be used immediately to solve ANN,
since they reduce the ANN problem into answering near neighbor queries, which seems to be also a hard problem to solve efficiently.
The key observation is, that we do not need to use exact balls. We are allowed to slightly deform the balls. This approximate
near neighbor problem is considerably easier.

Definition 12.3.1 For a ball b = b(p, r), a set b≈ is an (1 + ε)-approximation to b, if b ⊆ b≈ ⊆ b(p, (1 + ε)r).
For a set of balls B, the set B≈ is an (1 + ε)-approximation to B, if for any ball b ∈ B there is a corresponding (1 + ε)-
approximation b≈ ∈ B≈ . For a set b≈ ∈ B≈ , let b ∈ B denote the ball corresponding to b≈ , rb be the radius of b, and let pb ∈ P
denote the center of b.
For a query point q ∈ M, the target set of B≈ in q, is the set b≈ of B≈ that contains q and has the smallest radius rb .

Luckily, the interval near-neighbor data-structure still works in this settings.

Lemma 12.3.2 Let I≈ = I≈ (P, r, R, ε/16) be a (1 + ε/16)-approximation to I(P, b r, R, ε/16). For a query point, q ∈ M, if I≈
returns a ball centered at p ∈ P of radius α, and α ∈ [r, R] then p is (1 + ε/4)-ANN to q.

Proof: The data-structure of I≈ returns p as an ANN to q, if and only if there are two consecutive indices, such that q
is inside the union of the approximate balls of Ni+1 but not inside the balls of Ni . Thus, r(1 + ε/16)i ≤ dP (q) ≤ d(p, q) ≤
r(1 + ε/16)i+1 (1 + ε/16) ≤ (1 + ε/4)r. Thus p is indeed (1 + ε/4)-ANN.

Lemma 12.3.3 Let P be a set of n points in a metric space M, and let B be a set of balls with centers at P, computed by the
algorithm of Lemma 12.2.11, such that one can answer (1 + ε/16)-ANN queries on P, by performing a single target query in B.
Let B≈ be a (1 + ε/16)-approximation to B. A target query on B≈ , for a query point q, returns a (1 + ε)-ANN to q in P.

Proof: Let D be the tree computed by Lemma 12.2.11. We prove the correctness of the algorithm by an inductive proof over
the height of D, similar in nature to the proofs of Lemma 12.2.11 and Theorem 12.2.10.

127
Indeed, for a node v ∈ D, let I≈v = I≈ (Pv , rv , Rv , ε/16) be the set of (1 + ε/16)-approximate balls of B≈ that corresponds to the
bv = I(P
set of balls stored in I b v , rv , Rv , ε/4). If the algorithm stops at v, we know by Lemma 12.3.2 that we returned a (1+ε/4)-ANN
to q in Pv .
Otherwise, if we continued the search into the outer child of v using I bv , then we would also continue the search into the this
node when using I≈v . As such, by induction the result is correct.
Otherwise, we continue the search into a node w, where q ∈ Uballs (Pw , (1 + ε/16)rv ). We observe that because of the factor 2
slackness of Eq. (12.1), we are guaranteed to continue the search in the right connected component of Uballs (Pv , `v ).
Thus, the quality of approximation argument of Lemma 12.2.11 and Theorem 12.2.10 still holds, and require easy adaption to
this case (we omit the straightforward by tedios details). We conclude, that the result returned is correct.

12.4 Bibliographical notes

Finite metric spaces would receive more attention later in the course. HSTs, despite their simplicity are a powerful tool in giving a
compact (but approximate) representation of metric spaces.
Indyk and Motwani observed that ANN can be reduced to small number of near neighbor queries (i.e., this is the PLEB data-
structure of [IM98]). In particular, the elegant argument in Remark 12.2.3 is due to Indyk (personal communications). Indyk and
Motwani also provided a rather involved reduction showing that a near linear number of balls is sufficient. A considerably simpler
reduction was provided by Har-Peled [Har01b], and we loosely follow his exposition in Section 12.2, although our bound on the
number of balls is better by a logarithmic factor than the bounded provided by Har-Peled. This improvement was pointed out by
Sabharwal et al. [SSS02]. They also provide a more involved construction, which achieves a linear dependency on n.

Open problems. The argument showing that one can use approximate near-neighbor data-structure instead of exact (i.e.,
Lemma 12.3.3), is tedious and far from being elegant; see Exercise 12.5.2. A natural question for further research, is to try and
give a simple concrete condition on the set of balls, such that using approximate near neighbor data-structure still give the correct
result. Curretly, it seems that what we need is somekind of separability property. However, it would be nice to give a simpelr direct
condition.
As mentioned above, Sabharwal et al. [SSS02] showed a reduction from ANN to linear number of balls (ignoring the depen-
dency on ε). However, it seems like a simpler construction should work.

12.5 Exercises
Exercise 12.5.1 (Lower bound on computing HST) [10 Points]
Show, that by adversarial argument, that for any t > 1, we have: Any
algorithm computing a HST H for n points in a metric
space M, that t approximates dM , must in the worst case inspect all n2 distances in the metric. Thus, computing a HST requires
quadratic time in the worst case.

Exercise 12.5.2 (No direct approximation to ball set) [10 Points]

Lemma 12.3.3 is not elegant. A more natural conjecture would be the following:

Conjecture 12.5.3 Let P be a set of n points in the plane, and let B be a set of balls such that one
can answer (1 + ε/c)-ANN queries on P, by performing a single target query in B. Now, let B≈ be a
(1 + ε/c)-approximation to B. Then a target query on B≈ , answers (1 + ε)-ANN on P, for c large enough
constant.
Give a counter example to the above conjecture, showing that it is incorrect.

128
Chapter 13

Approximate Voronoi Diagrams

“She had given him a smile, first because that was more or less what she was there for, and then because she had
never seen him before and she had a prejudice in favor of people she did not know.”
– – The roots of heaven, Romain Gary

13.1 Introduction
A
Voronoi
diagram of a pointset P ⊆ IRd is a partition of space into regions, such that the cell of p ∈ P, is V(p, P) =
x kxpk ≤ kxp0 k for all p0 ∈ P . Vornoi diagrams are a powerful tool and have numerous applications [Aur91].
One problem with Vornoi diagrams is that their descriptive complexity is O(ndd/2e ) in IRd , in the worst case. See Figure 13.1
for an exmaple in three dimensions. It is a natrual question to ask, whether one can reduce the complexity to linear (or near linear)
by allow some approximation.

Definition 13.1.1 (Approximate Voronoi Diagram.) Given a set P of n points in IRd , and parameter ε > 0, an (1+ε)-approximated
Voronoi diagram of P, is a partition V of space into regions, such that for any region ϕ ∈ V, there is an associated point repϕ ∈ P,
such that for any x ∈ ϕ, we have that repϕ is a (1 + ε)-ANN for x in P. We will refer to V as (1 + ε)-AVD.

13.2 Fast ANN in IRd

In the following, assume P is a set of points contained in the hypercube [0.5 − ε/d, 0.5 + ε/d]d . This can be easily guaranteed by
affine linear transformation T which preserves relative distances (i.e., computing the ANN for q on P, is equivalent to computing
the ANN for T (q) on T (P)).
d
In particular, if the query point is outside the unit hypercube 0, 1 then any point of P is (1 + ε)-ANN, and we are done. Thus,
we need to answer ANN only for points inside the unit hypercube.
We are going to use the reduction we saw between ANN and target queries among balls. In particular, one can compute the set
−1
of (exact) balls of Lemma 12.3.3 using the algorithm of Theorem 12.2.12. Let B be this set of balls. This takes O(nε √ i log(n/ε))
time. Next, wej approximate
√ k each ball b ∈ B of radius r, by the set of grid cells of G2 i that intersects it, where d2 ≤ (ε/16)r;
namely, i = lg(εr/ d) . In particular, let b≈ denote the region corresponding to the grid cells intersected by b. Clearly, b≈ is an
approximate ball of b.
Let B≈ be the resulting set of approximate balls for B, and let C0 be the associated (multi) set of grid cells formed by those
balls. The multi-set C0 is a collection of canonical grid cells from different resolutions. We create a set out of C0 , by picking from
all the instances of the same cell 2 ∈ C0 , the instance associated with the smallest ball of B. Let C be the resulting set of canonical
grid cells.
Thus, by Lemma 12.3.3, answering (1 + ε)-ANN is no more than performing a target query on B≈ . This in turn, just finding
the smallest canonical grid cell of C which contains the query point q. In previous lecture, we showed that one can construct
a compressed quadtree T b that has all the nodes of C appear as nodes of T b. The construction of this compressed quadtree takes
O(|C| log |C|) time, by Lemma 2.2.5.
By performing a point-location query in T b, we can compute the smallest grid cell of C that contains the query point in O(log |C|)
time. Specificly, the query returns a leaf v of T b. We need to find, along the path of v to the root, the smallest ball associated with
those nodes. This information can be propogated down the compressed quadtree during the preprocessing stage, and as such, once
we have v at hand, one can answer the ANN query in constant time. We conclude:

129
(a) (b) (c)

Figure 13.1: (a) The point-set in 3D inducting a Voronoi diagram of quadratic complexity. (b) Some cells
in this Voronoi diagram. Note that the cells are thin and flat, and every cell from the lower part touches the
cells on the upper part. (c) The contact surface between the two parts of the Voronoi diagram has quadratic
complexity, and thus the Voronoi diagram itself has quadratic complexity.

Theorem 13.2.1 Let P be a set of n points in IRd . One can build a compressed quadtree T b, in O(nε−d log2 (n/ε)) time, of size
−d b. Such a point-location
O(nε log(n/ε)), such that (1 + ε)-ANN query on P, can be answered by a single point-location query in T
query takes O(log(n/ε)) time.

Proof: The construction is described above. We only need to prove the bounds on the running time. It takes O(nε−1 log(n/ε))
time to compute B. For every such ball, we generate O(1/εd ) canonical grid cells that cover it. Let C be the resulting set of grid
cells (after filtering multiple instance of the same grid cell). Naively, the size of C is bounded by O |B| /εd . However, it is easy to
verify that B has a large number of balls of similar size centered at the same point (because the set I b has a lot of such balls). In
particular, if we have the set of balls b
I({p} , r, 2r, ε/16), it requires only O(1/εd
) canonical grid cells to approximate it. Thus, we

can bound |C| by N = O |B| /εd−1 = O(nε−d log(n/ε)). We can also compute C in this time, by careful implementation (we omit
the straightforward and tedious details).
Constructing T b for C takes O(N log N) = O(nε−d log2 (n/ε)) time (Lemma 2.2.5). The resulting compressed quadtree is of size
O(N). Point location queries at T b takes O(log N) = O(log(n/ε)) time. Given a query point, and the leaf v, such that q ∈ 2v , we
need to find the first ancestor above v that has a point associated with it. This can be done in constant ti1me, by preprocessing T b,
by propagating down the compressed quadtree the nodes they are associated with.

13.3 A Direct Construction of AVD

Intuitively, constructing an approximate Voronoi diagram, can be interpret as a meshing problem. Indeed, consider the function
dP (q), for q ∈ IRd ,and a cell 2 ⊆ IRd such that diam(2) ≤ (ε/4) minq∈2 dP (q). Since for any x, y ∈ IRd , we have dP (y) ≤ dP (x)+kxyk.
It follows that
∀x, y ∈ 2 dP (x) ≤ (1 + ε/4)dP (y). (13.1)
Namely, the distance function dP (·) is essentially a “constant” in such a cell. Of course, if there is only a unique nearest neighbor
to all the points in 2, we do not need this condition.
Namely, we need to partition space into cells, such that for each cell, either it has a unique nearest neighbor (no problem), or
alternatively, Eq. (13.1) holds in the cell. Unfortunately, forcing Eq. (13.1) globally is too expensive. Alternatively, we sill force
Eq. (13.1) only in “critical” regions where there are two points which are close to being ANN to a cell.
The general idea, as before is to generate a set of canonical grid cells. We are guaranteed, that when generating the compressed
quadtree for the point-set, and considering the space partition induced by the leafs, then necessarily those cells would be smaller
than the input grid cells.

Definition 13.3.1 (Exponential Grid.) For a point p ∈ IRd , and parameters r, R and ε > 0, let GE (p, r, R, ε) denote an exponential
grid centered at p.

130
Figure 13.2: Exponential grid.

i 0
let bi = b(p, ri ), for i = 0, . . . , lg
R/r
, where ri = r2 . Next, let Gi be the set of cells of the canonical grid Gαi that intersects
bi , where αi = 2blg(εri /(16d))c . Clearly, G0i = O(1/εd ). We remove from G0i all the canonical cells completely covered by cells of Gi−1 .
Similarly, for cells the that are partially covered in G0i by cells in G0i−1 , we replace them by the cells covering them in Gαi−1 . Let Gi
be the resulting set of canonical grid cells. And let GE (p, r, R, ε) = ∪i Gi . We have |GE (p, r, R, ε)| = O(ε−d log(R/r)). Furthermore,
it can be computed in linear time in its size; see Figure 13.2.

Let P be a set of n points, and let 0 < ε < 1/2. As before, we assume that P ⊆ [0.5 − ε/d, 0.5 + ε/d]d .
Compute a (1/(8d))−1 -WSPD W of P. Note that |W| = O(n). For every pair X = {u, v} ∈ W, let ùv = kuvk, and consider the
set of canonical cells
W(u, v) = GE (repu , ùv /4, 4ùv /ε, ε) ∪ GE (repv , ùv /4, 4ùv /ε, ε).
For a query point q, if dP (q) ∈ [ùv /4, ùv /ε], and the nearest neighbor of q is in Pu ∪ Pv , then the cell 2 ∈ W(u, v) containing q is
of the right size, it comply with Eq. (13.1), and as such any (1 + ε/4)-ANN to any point of 2 is a (1 + ε)-ANN to q.
Thus, let W = ∪{u,v}∈W W(u, v) ∪ [0, 1]d . Let T b be a compressed quadtree constructed so that it contains all the canonical nodes
of W as nodes (we remind the reader that this can be done in O(|W| log |W|) time; see Lemma 2.2.5). Next, let U be the space
decomposition induced by the leafs of T b.
For each cell 2 ∈ U, take an arbitrary point inside it rep2 , and compute a (1 + ε/4)-ANN to p. Let rep g2 ∈ P be this ANN, and
store it together with 2.

Claim 13.3.2 The set U is a (1 + ε)-AVD.

Proof: Let q be am arbitrary query point, and let 2 ∈ U be the cell containing q. Let rep2 ∈ 2 be its representative, and let repg2 ∈ P
be the (1 + ε/4)-ANN to rep2 stored at 2. Also, let b q be the nearest neighbor of q in g2 = b
P. If rep q then we are done. Otherwise,
g2 ∈ Pu and b
consider the pair {u, v} ∈ W such that rep q ∈ Pv . Finally, let ` = repgb q .
2
If qbq > `/ε then rep
g2 is (1 + ε)-ANN since q rep g2 ≤ qb q + b g2 ≤ (1 + ε) qb
q rep q .

If qbq < `/4, then by the construction of W(u, v), we have that diam(2) ≤ (ε/16) repu repv ≤ ε`/8. This holds by the
construction of the WSPD, where repu , repv are the representative from the WSPD construction of Pu and Pv , respecitvely. See
Figure 13.3. But then,
` `ε 5
q rep2 ≤ qb
b q + diam(2) ≤ + < `,
4 8 16
for ε ≤ 1/2. On the other hand,

g2 ≥ repu repv − repvb
rep2 rep q − b
qq − diam(2) − repu repg2
≥ ` − `/8 − `/4 − ε`/8 − `/8 ≥ (7/16)`.

q ≤ (5/16)(9/8)` < `/2 ≤ rep2 rep
But then, (1 + ε/4) rep2b g2 . A contradiction, because rep
g2 is not (1 + ε/4)-ANN in P for
rep2 . See Figure 13.3.

131
q rep

≤ l/4
qb r]
ep
`

l/8

Figure 13.3: Illustration of the proof of Claim 13.3.2.

If qb
q ∈ [`/4, `/ε], then by the construction of W(u, v), we have that diam(2) ≤ (ε/4)dPu ∪Pv (q). We have dP (z) ≤ (1 +
ε/4)dP (q) and

g2 ≤ q rep2 + rep2 rep
q rep g2 ≤ diam(2) + (1 + ε/4)dP (rep2 )
≤ (ε/4)dP (q) + (1 + ε/4)(1 + ε/4)dP (q) ≤ (1 + ε)dP (q),

as required.

Theorem 13.3.3 Given a set P of n points in IRd one can compute, in O(n/εd log(1/ε)(ε−d + log(n/ε))) time, a (1 + ε)-AVD of P.
The AVD is of complexity O(nε−d log(1/ε)).

Proof: The total number of cubes in W is O(nε−d log(1/ε)), and W can also be computed in this time, as described above. For
each node of W we need to perform a (1 + ε/4)-ANN query on P. After O(n log n) preprocessing, such queries can be answered in
O(log n + 1/εd ) time (Theorem 11.3.3). Thus, we can answer those queries, and built the overall data-structure, in the time stated
in the theorem.

13.4 Bibliographical notes

The realization that point-location among approximate balls is sufficient for ANN, was realized by Indyk and Motwani [IM98].
The idea of first building a global set of balls, and using a compressed quadtree on the associated set of approximate balls, is due
to Har-Peled [Har01b]. The space used by the data-structure of Theorem 13.2.1 was further improved (by a logarithmic factor) by
Sabharwal et al. [SSS02] and Arya and Malamatos [AM02]. As mentioned before, Sabharwal et al. improved the number of balls
needed (for the general metric case), while Arya and Malamatos showed a direct construction for the low-dimensional euclidean
space using O(n/εd ) balls.
The direct construction (Section 13.3) of AVD is due to Arya and Malamatos [AM02]. The reader might wonder why we went
through the excruciating pain of first approximating the metric by balls, and then using it to construct AVD. The reduction to point
location among approximate balls, would prove to be useful when dealing with proximity in high dimensions. Of course, once we
have the fact about ANN via point-location among approximate balls, the AVD construction is relatively easy.

132
Chapter 14

The Johnson-Lindenstrauss Lemma

In this chapter, we will prove that given a set P of n points in IRd , one can reduce the dimension of the points to k = O(ε−2 log n) and
distances are 1 ± ε reserved. Surprisingly, this reduction is done by randomly picking a subspace of k dimensions and projecting
the points into this random subspace. One way of thinking about this result is that we are “compressing” the input of size nd (i.e.,
n points with d coordinates) into size O(nε−2 log n), while (approximately) preserving distances.

14.1 The Brunn-Minkowski inequality

For a set A ⊆ IRd , an a point p ∈ IRd , let A + p denote the translation of A by p. Formally, A + p = q + p q ∈ A .

Definition 14.1.1 For two sets A and B in IRn , let A + B

denote the Minkowski sum of A and B. Formally,

A + B = a + b a ∈ A, b ∈ B = ∪p∈A (p + B). + =
It is easy to verify that if A0 , B0 are translated copies of A, B (that is, A0 = A + p and B = B + q, for some points p, q ∈ IRd ),
respectively, then A0 + B0 is a translated copy of A + B. In particular, since volume is preserved under translation, we have that
Vol(A0 + B0 ) = Vol((A + B) + p + q) = Vol(A + B).

Theorem 14.1.2 (Brunn-Minkowski inequality) Let A and B be two non-empty compact sets in IRn . Then
Vol(A + B)1/n ≥ Vol(A)1/n + Vol(B)1/n .

Definition 14.1.3 A set A ⊆ IRn is a brick set if it is the union of finitely many (close) axis parallel boxes with disjoint interiors.

It is intuitively clear, by limit arguments, that proving Theorem 14.1.2 for brick sets will imply it for the general case.

Lemma 14.1.4 (Brunn-Minkowski inequality for Brick Sets) Let A and B be two non-empty brick sets in IRn . Then
Vol(A + B)1/n ≥ Vol(A)1/n + Vol(B)1/n

Proof: By induction on the number k of bricks in A and B. If k = 2 then A and B are just bricks, with dimensions a1 , . . . , an
and b1 , . . . , bn , respectively. In this case, the dimensions of A + B are a1 + b1 , . . . , an + bn , as can be easily verified. Thus, we need
Qn 1/n Qn Q
to prove that i=1 ai + i=1 bi 1/n ≤ ni=1 (ai + bi ) 1/n . Dividing the left side by the right side, we have
 n   n 1/n
Y ai 1/n Y bi  1 X ai
n
1 X bi
n
  +   ≤ + = 1,
i=1
ai + bi i=1
ai + bi n i=1 ai + bi n i=1 ai + bi

by the generalized arithmetic-geometric mean inequality¯ , and the claim follows for this case.
¯
Here is a proof of this generalized form: Let x1 , . . . , xn be n positive real numbers. Consider the quantity R = √x1 x2 · · · xn . If we
√
fix the sum of the n numbers to be equal α, then R is maximized when all the xi s are equal. Thus, n x1 x2 · · · xn ≤ n (α/n)n = α/n =
(x1 + · · · + xn )/n.

133
Now let k > 2 and suppose that the Brunn-Minkowski inequality holds for any pair of brick sets with fewer than k bricks
(together). Let A, B be a pair of sets having k bricks together, and A has at least two (disjoint) bricks. However, this implies that
there is an axis parallel hyperplane h that separates between the interior of one brick of A and the interior of another brick of A (the
hyperplane h might intersect other bricks of A). Assume that h is the hyperplane x1 = 0 (this can be achieved by translation and
renaming of coordinates).
Let A+ = A ∩ h+ and A− = A ∩ h− , where h+ and h− are the two open half spaces induced by h. Let A+ and A− be the closure
of A and A− , respectively. Clearly, A+ and A− are both brick sets with (at least) one fewer brick than A.
+

Next, observe that the claim is translation invariant, and as such, let us translate B so that its volume is split by h in the same
ratio A’s volume is being split. Denote the two parts of B by B+ and B− , respectively. Let ρ = Vol(A+ )/ Vol(A) = Vol(B+ )/ Vol(B)
(if Vol(A) = 0 or Vol(B) = 0 the claim trivially holds).
Observe, that A+ + B+ ⊆ A + B, and it lies on one side of h, and similarly A− + B− ⊆ A + B and it lies on the other side of h.
Thus, by induction, we have
Vol(A + B) ≥ Vol(A+ + B+ ) + Vol(A− + B− )
n n
≥ Vol(A+ )1/n + Vol(B+ )1/n + Vol(A− )1/n + Vol(B− )1/n
n
= ρ1/n Vol(A)1/n + ρ1/n Vol(B)1/n
n
+ (1 − ρ)1/n Vol(A)1/n + (1 − ρ)1/n Vol(B)1/n
n
= (ρ + (1 − ρ)) Vol(A)1/n + Vol(B)1/n
n
= Vol(A)1/n + Vol(B)1/n ,

establishing the claim.

S
Proof of Theorem 14.1.2: Let A1 ⊆ A2 ⊆ · · · Ai ⊆ be a sequence of brick sets, such that i Ai = A, and similarly let B1 ⊆ B2 ⊆
S
· · · Bi ⊆ · · · be a sequence of finite brick sets, such that i Bi = B. It is well known fact in measure theory, that limi→∞ Vol(Ai ) =
Vol(A) and limi→∞ Vol(Bi ) = Vol(B).
We claim that limi→∞ Vol(Ai + Bi ) = Vol(A + B). Indeed, consider any point z ∈ A + B, and let u ∈ A and v ∈ B be such that
u+v = z. By definition, there exists i, such that for all j > i we have u ∈ A j , v ∈ B j , and as such z ∈ Ai + Bi . Thus, ∪i (Ai + Bi ) = A+ B.
Furthermore, for any i > 0, since Ai and Bi are brick sets, we have
Vol(Ai + Bi )1/n ≥ Vol(Ai )1/n + Vol(Bi )1/n ,
by Lemma 14.1.4. Thus,

Vol(A + B) = lim Vol(Ai + Bi )1/n ≥ lim Vol(Ai )1/n + Vol(Bi )1/n
i→∞ i→∞

= Vol(A)1/n + Vol(B)1/n .

Theorem 14.1.5 (Brunn-Minkowski for slice volumes.) Let P be a convex set in IRn+1 , and let A = P∩(x1 = a), B = P∩(x1 = b)
and C = P ∩ (x1 = c) be three slices of A, for a < b < c. We have Vol(B) ≥ min(Vol(A), Vol(C)).
In fact, consider the function
v(t) = (Vol(P ∩ (x1 = t)))1/n ,
and let I = [tmin , tmax ] be the interval where the hyperplane x1 = t intersects P. Then, v(t) is concave in I.

Proof: If a or c are outside I, then Vol(A) = 0 or Vol(C) = 0, respectively, and then the claim trivially holds.
Otherwise, let α = (b − a)/(c − a). We have that b = (1 − α) · a + α · c, and by the convexity of P, we have (1 − α)A + αC ⊆ B.
Thus, by Theorem 14.1.2 we have
v(b) = Vol(B)1/n ≥ Vol((1 − α)A + αC)1/n ≥ Vol((1 − α)A)1/n + Vol(αC)1/n
= (1 − α) · Vol(A)1/n + α · Vol(C)1/n
≥ (1 − α)v(a) + αv(c).
Namely, v(·) is concave on I, and in particular v(b) ≥ min(v(a), v(c)), which in turn implies that Vol(B) = v(b)n ≥ min(Vol(A), Vol(B)),
as claimed.
q
Corollary 14.1.6 For A and B compact sets in IRn , we have Vol((A + B)/2) ≥ Vol(A) Vol(B).
p
Proof: Vol((A + B)/2)1/n = Vol(A/2 +√B/2)1/n ≥ Vol(A/2)1/n + Vol(B/2)1/n = (Vol(A)1/n + Vol(B)n )/2 ≥ Vol(A)1/n Vol(B)1/n
by Theorem 14.1.2, and since (a + b)/2 ≥ ab for any a, b ≥ 0. The claim now follows by raising this inequality to the power n.

134
14.1.1 The Isoperimetric Inequality
Useless Stuff
The following is not used anywhere else and is provided because of its mathematical elegance. The skip-able reader can thus skip
ning!!!
this section.
The isoperimetric inequality states that among all convex bodies of a fixed surface area, the ball has the largest volume (in
particular, the unit circle is the largest area planar region with perimeter 2π). This problem can be traced back to antiquity, in
particular Zenodorus (200–140 BC) wrote a monograph (which was lost) that seemed to have proved the claim in the plane for
some special cases. The first formal proof for the planar case was done by Steiner in 1841. Interestingly, the more general claim is
an easy consequence of the Brunn-Minkowski inequality.
Let K be a convex body in IRn and b = bn be the n dimensional ball of radius one centered at the origin. Let S(X) denote the
surface area of a compact set X ⊆ IRn . The isoperimetric inequality states that
!1/n !1/(n−1)
Vol(K) S(K)
≤ . (14.1)
Vol(b) S(b)

Namely, the left side is the radius of a ball having the same volume as K, and the right side is
the radius of a sphere having the same surface area as K. In particular, if we scale K so that its
surface area is the same as b, then the above inequality implies that Vol(K) ≤ Vol(b).
To prove Eq. (14.1), observe that Vol(b) = S(b) /n . Also, observe that K + ε b is the body
K together with a small “atmosphere” around it of thickness ε. In particular, the volume of this
“atmosphere” is (roughly) ε S(K) (in fact, Minkowski defined the surface area of a convex body
to be the limit stated next). Formally, we have
n
Vol(K + ε b) − Vol(K) Vol(K)1/n + Vol(ε b)1/n − Vol(K)
S(K) = lim ≥ lim ,
ε→0+ ε ε→0+ ε

by the Brunn-Minkowski inequality. Now Vol(ε b)1/n = ε Vol(b)1/n , and as such

Vol(K) + n1 ε Vol(K)(n−1)/n Vol(b)1/n + n2 ε2 hwhateveri · · · + εn Vol(b) − Vol(K)
S(K) ≥ lim
ε→0+ ε
nε Vol(K)(n−1)/n Vol(b)1/n
= lim = n Vol(K)(n−1)/n Vol(b)1/n .
ε→0+ ε
Dividing both sides by S(b) = n Vol(b), we have
!1/(n−1) !1/n
S(K) Vol(K)(n−1)/n S(K) Vol(K)
≥ ⇒ ≥ ,
S(b) Vol(b)(n−1)/n S(b) Vol(b)

establishing the isoperimetric inequality.

14.2 Measure Concentration on the Sphere

Let S(n−1) be the unit sphere in IRn . We assume there is a uniform probability measure defined over S(n−1) , such that its total measure
is 1. Surprisingly, most of the mass of this measure is near the equator. In fact, as the dimension increases, the width of the strip
around the equator (x1 = 0) ∩ S(n−1) contains, say, 90% of the measure is of width ≈ c/n, for some constant c. Counter intuitively,
this is true for any equator. We are going to show that a stronger result holds: The mass is concentrated close to the boundary of
any set A ⊆ S(n−1) such that Pr[A] = 1/2.
Before proving this someswhat surprising theorem, we will first try to get an intuition about the behaviour of the hypersphere
in high dimensions.

14.2.1 The strange and curious life of the hypersphere

Consider the ball of radius r denoted by r bn , where bn is the unit radius ball centered at the origin. Clearly, Vol(r bn ) = rn Vol(bn ).
Now, even if r is very close to 1, the quantity rn might be very close to zero if n is sufficiently large. Indeed, if r = 1 − δ, then
rn ≤ (1 − δ)n ≤ exp(−δn), which is very small if δ 1/n. (Here, we used the fact that 1 − x ≤ e x , for x ≥ 0.) Namely, for the ball
in high dimensions, its mass is concentrated in a very thin shell close to its surface.
R1

Indeed, Vol(b) = r=0
S(b) rn−1 dr = S(b) /n.

135
The volume of a ball and the surface area of hypersphere. In fact, let Vol(rbn ) denote the volume of the ball of
radius r in IRn , Area rS(n) denote the surface area of its boundry (i.e., the surface area of rS(n−1) ). It is known that

πn/2 rn 2πn/2 rn−1

Vol(rbn ) = and Area rS(n−1) = ,
Γ(n/2 + 1) Γ(n/2)
where the Γ(·) is an extension of the factorial function. Specifically, if n is even then Γ(n/2 + 1) = (n/2)!, and for n odd Γ(n/2 + 1) =
√
π(n!!)/2(n+1)/2 , where n!! = 1 · 3 · 5 · · · n is the double factorial. The most surprising implication of these two formulas is that the
volume of the unit ball increases (till dimension 5 in fact) and then it starts decreasing to zero. Similarly, the surface area of the unit
sphere S(n−1) in IRn tends to zero as the dimension increases. To see this compute the volume of the unit ball using an integral of its
slice volume, when it is being sliced by a hyperplanes perpendicular to the nth coordinate. We have
Z 1 q Z 1 (n−1)/2
Vol(bn ) = Vol 1 − xn2 bn−1 dxn = Vol bn−1 1 − xn2 dxn .
xn =−1 xn =−1

(n−1)/2
Now, the integral on the right side tends to zero as n increases. In fact, for n very large, the term 1 − xn2 is very close to 0
everywhere except for a small interval around 0. This implies that the main contribution of the volume of the ball happens when
we consider slices of the ball by hyperplanes of the form xn = δ, where δ is small.
If one has to visualize how such a ball in high dimesions looks like, it might be best to think about it as a star-like creature:
It has very little mass close to the tips of any set of orthogonal directions we pick, and most of its mass somehow lies close to its
center.®

14.2.2 Measure Concentration on the Sphere

Theorem 14.2.1 (Measure concentration on the sphere.) Let A ⊆ S(n−1) be a measurableset with Pr[A] ≥ 1/2, and let At denote
the set of points of S(n−1) in distance at most t from A, where t ≤ 2. Then 1 − Pr[At ] ≤ 2 exp −nt2 /2 .

Proof: We will prove a slightly weaker bound, with −nt2 /4 in the exponent. Let

b = αx x ∈ A, α ∈ [0, 1] ⊆ bn ,
A

where bn is the unit ball in IRn . We have that Pr[A] = µ A b / Vol(bn )¯
b = Vol A
b , where µ A
(n−1)
Let B = S \ At . We have that ka − bk ≥ t for all a ∈ A and b ∈ B.

a + b t2
Lemma 14.2.2 For any b b and b
a∈A B, we have
b∈b ≤ 1 − .
2 8

a = αa and b
Proof: Let b b = βb, where a ∈ A and b ∈ B. We have
s r a

{
a + b a − b 2 t2 t2
kuk = = 12 − ≤ 1 − ≤ 1 − , (14.2)
2 2 4 8
≤

u
2

a and bb, assume that α ≤ β, and observe that the quantity b b

a +b
t/

since ka − bk ≥ t. As for b
b
h

is maximized when β = 1. As such, by the triangle inequality, we have

ba +b
b αa + b α(a + b) b o
= ≤ + (1 − α)
2 2 2 2
!
t2 1
≤ α 1− + (1 − α) = τ,
8 2

by Eq. (14.2) and since kbk = 1. Now, τ is a convex combination of the two numbers 1/2 and 1 − t2 /8. In particular, we
conclude that τ ≤ max(1/2, 1 − t2 /8) ≤ 1 − t2 /8, since t ≤ 2.

®
In short, it looks like a Boojum [Car76].
¯
This is one of these “trivial” claims that might give the reader a pause, so here is a formal proof. Pick a random point p uniformly
b Clearly, Vol(A)
inside the ball bn . Let ψ be the probability that p ∈ A. b n
= ψ Vol(b ). So, consider theh normalized
i point q = p/ kpk.
b b b b
Clearly, p ∈ A if and only if q ∈ A, by the definition of A. Thus, µ A = Vol(A)/ Vol(bn ) = ψ = Pr p ∈ A b = Pr[q ∈ A] = Pr[A],
since q has a uniform distribution on the hypersphere by the symmetry of bn .

136

By Lemma 14.2.2, the set A b+ bB /2 is contained in a ball of radius ≤ 1 − t2 /8 around the origin. Applying the Brunn-
Minkowski inequality in the form of Corollary 14.1.6, we have
!n   q
t2 b+ b
 A B  p p
1− ≥ µ bµb
 ≥ µ A B = Pr[A] Pr[B] ≥ Pr[B] /2.
8 2

Thus, Pr[B] ≤ 2(1 − t2 /8)2n ≤ 2 exp(−2nt2 /8), since 1 − x ≤ exp(−x), for x ≥ 0.

14.3 Concentration of Lipschitz Functions

Consider a function
f : S (n−1) → IR. Furthermore, imagine that we have a probability density defined over the sphere. Let

Pr f ≤ t = Pr x ∈ S n−1 f (x) ≤ t . We define the median of f , denoted by med( f ), to be the sup t, such that Pr f ≤ t ≤ 1/2.

Lemma 14.3.1 We have Pr f < med( f ) ≤ 1/2 and Pr f > med( f ) ≤ 1/2.
S
Proof: Since k≥1 (−∞, med( f ) − 1/k] = (−∞, med( f )), we have
" #
1 1
Pr f < med( f ) = sup Pr f ≤ med( f ) − ≤ .
k≥1 k 2

Definition 14.3.2 (c-Lipschitz) A function f : A → B is c-Lipschitz if, for any x, y ∈ A, we have k f (x) − f (y)k ≤ c kx − yk.

Theorem 14.3.3 (Lévy’s Lemma.) Let f : S(n−1) → IR be 1-Lipschitz. Then for all t ∈ [0, 1],

Pr f > med( f ) + t ≤ 2 exp −t2 n/2 and Pr f < med( f ) − t ≤ 2 exp −t2 n/2 .

Proof: We prove only the first inequality, the second follows by symmetry. Let

A = x ∈ S(n−1) f (x) ≤ med( f ) .

By Lemma 14.3.1, we have Pr[A] ≥ 1/2. Since f is 1-Lipschitz,

we
have f (x) ≤ med( f ) + t, for any x ∈ At . Thus, by Theo-

rem 14.2.1, we get Pr f > med( f ) + t ≤ 1 − Pr[At ] ≤ 2 exp −t2 n/2 .

14.4 The Johnson-Lindenstrauss Lemma

Lemma 14.4.1 For a unit vector x ∈ S(n−1) , let
q
f (x) = x12 + x22 + · · · + xk2

be the length of the projection of x into the subspace formed by the first k coordinates. Let x be a vector randomly chosen with
uniform distribution from S(n−1) . Then f (x) is sharply concentrated. Namely, there exists m = m(n, k) such that

Pr f (x) ≥ m + t ≤ 2 exp(−t2 n/2) and Pr f (x) ≤ m − t ≤ 2 exp(−t2 n/2).
√
Furthermore, for k ≥ 10 ln n, we have m ≥ 12 k/n.

Proof: The orthogonal projection p : `2n → `2k given by p(x1 , . . . , xn ) = (x1 , . . . , xk ) is 1-Lipschitz (since projections can only
shrink distances, see Exercise 14.7.4). As such, f (x) = kp(x)k is 1-Lipschitz, since for any x, y we have

| f (x) − f (y)| = kp(x)k − kp(y)k ≤ kp(x) − p(y)k ≤ kx − yk ,

by the triangle inequality and since p is 1-Lipschitz. Theorem 14.3.3 (i.e., Lévy’s lemma) gives the required tail estimate with
m = med( f ). h i
Thus, we only need to prove the lower bound on m. For a random x = (x1 , . . . , xn ) ∈ S(n−1) , we have E kxk2 = 1. By linearity
h i h Pn i Pn h i h i h i
of expectations, and symmetry, we have 1 = E kxk2 = E i=1 xi2 = i=1 E xi2 = n E x2j , for any 1 ≤ j ≤ n. Thus, E x2j = 1/n,
h i
for j = 1, . . . , n. Thus, E ( f (x))2 = k/n. We next use the fact that f is concentrated, to show that f 2 is also relatively concentrated.
For any t ≥ 0, we have
k h i
= E f 2 ≤ Pr f ≤ m + t (m + t)2 + Pr f ≥ m + t · 1 ≤ 1 · (m + t)2 + 2 exp(−t2 n/2),
n

137
√
since f (x) ≤ 1, for any x ∈ S(n−1) . Let t = k/5n. Since k ≥ 10 ln n, we have that 2 exp(−t2 n/2) ≤ 2/n. We get that
k p 2
≤ m + k/5n + 2/n.
n
√ √ √ √ 1
√
Implying that (k − 2)/n ≤ m + k/5n, which in turn implies that m ≥ (k − 2)/n − k/5n ≥ 2
k/n.
At this point, we would like to flip Lemma 14.4.1 around, and instead of randomly picking a point and projecting it down to
the first k-dimensional space, we would like x to be fixed, and randomly pick the k-dimensional subspace. However, we need to
pick this k-dimensional space carefully, so that if we rotate this random subspace, by a transformation T , so that it occupies the first
k dimensions, then the point T (x) is uniformly distributed on the hypersphere.
To this end, we would like to randomly pick a random rotation of IRn . This is an orthonormal matrix with determinant 1. We
can generate such a matrix, by randomly picking a vector e1 ∈ S(n−1) . Next, we set e1 is the first column of our rotation matrix, and
generate the other n − 1 columns, by generating recursively n − 1 orthonormal vectors in the space orthogonal to e1 .

Generating a random vector from the unit hypersphere, and a random rotation. At this point, the reader might
wonder how do we pick a point uniformly from the unit hypersphere. The idea is to pick a point from the multi-
dimensional normal distribution N d (0, 1), and normalizing it to have length 1. Since the multi-dimensional normal
distribution has the density function

(2π)−n/2 exp −(x12 + x22 + · · · + xn2 )/2 ,

which is symmetric (i.e., all the points in distance r from the origin has the same distribution), it follows that this
indeed randomly generates a point randomly and uniformly on S(n−1) .
Generating a vector with multi-dimensional normal distribution, is no more than picking each coordinate accord-
ing to the normal distribution. Given a source of random numbers according to the uniform distribution, this can be
done using a O(1) computations, using the Box-Muller transformation [BM58].
Since projecting down n-dimensional normal distribution to the lower dimensional space yields a normal dis-
tribution, it follows that generating a random projection, is no more than randomly picking n vectors according to
the multidimensional normal distribution v1 , . . . , vn . Then, we orthonormalize them, using Graham-Schmidt, where
vb1 = v1 / kv1 k, and bvi is the normalized vector of vi − wi , where wi is the projection of vi to the space spanned by
v1 , . . . , vi−1 .
Taking those vectors as columns of a matrix, generates a matrix A, with determinant either 1 or −1. We multiply
one of the vectors by −1 if the determinant is −1. The resulting matrix is a random rotation matrix.

Definition 14.4.2 The mapping f : IRn → IRk is called K-bi-Lipschitz for a subset X ⊆ IRn if there exists a constant c > 0 such that

cK −1 · kp − qk ≤ k f (p) − f (q)k ≤ c · kp − qk ,

for all p, q ∈ X.
The least K for which f is K-bi-Lipschitz is called the distortion of f , and is denoted dist( f ). We will refer to f as a K-
embedding of X.

Theorem 14.4.3 (Johnson-Lindenstrauss lemma.) Let X be an n-point set in a Euclidean space, and let ε ∈ (0, 1] be given. Then
there exists a (1 + ε)-embedding of X into IRk , where k = O(ε−2 log n).

Proof: Let X ⊆ IRn (if X lies in higher dimensions, we can consider it to be lying in the span of its points, if it is in lower
dimensions, we can add zero coordinates). Let k = 200ε−2 ln n. Assume k < n, and let F be a random k-dimensional linear
subspace of IRn . Let PF : IRn → F be the orthogonal projection operator of IRn into F. Let m be the number around which kPF (x)k
is concentrated, for x ∈ S(n−1) , as in Lemma 14.4.1.
Fix two points x, y ∈ IRn , we prove that
ε ε
1− m kx − yk ≤ kPF (x) − PF (y)k ≤ 1 + m kx − yk
3 3

holds with probability ≥ 1 − n−2 . Since there are n2 pairs of points in X, it follows that with constant probability this holds for all
pair of points of X. In such a case, the mapping p is D-embedding of X into IRk with D ≤ 1+ε/3 ≤ 1 + ε, for ε ≤ 1.
1−ε/3
Let u = x − y, we have PF (u) = PF (x) − PF (y) since PF (·) is a linear operator. Thus, the condition becomes 1 − 3ε m kuk ≤

kPF (u)k ≤ 1 + 3ε m kuk. Since this condition is scale independent, we can assume kuk = 1. Namely, we need to show that

kPF (u)k − m ≤ ε m.
3

138
By Lemma 14.4.1 (exchanging the random space with the random vector), for t = εm/3, we have that the probablity that this does
not hold is bounded by ! ! !
t2 n −ε2 m2 n ε2 k
4 exp − = 4 exp ≤ 4 exp − < n−2 ,
2 18 72
√
since m ≥ 21 k/n.

14.5 An alternative proof of the Johnson-Lindenstrauss lemma

14.5.1 Some Probability
2 √
Definition 14.5.1 Let N(0, 1) denote the one dimensional normal distribution. This distribution has density n(x) = e−x /2 / 2π.
Let N d (0, 1) denote the d-dimensional Gaussian distribution, induced by picking each coordinate independently from the
standard normal distribution N(0, 1).
Let Exp(λ) denote the exponential distribution, with parameter λ. The density function of the exponential distribution is
f (x) = λ exp(−λx).
Let Γλ,k denote the gamma distribution, with parameters λ and k. The density function
of this distribution is gλ,k (x)
=
k−1 i k−1
λ (λx)
(k−1)!
exp(−λx). The cumulative distribution function of Γλ,k is Gλ,k (x) = 1 − exp(−λx) 1 + λx1!
+ · · · + (λx)
i!
+ · · · + (λx)
(k−1)!
. As
we prove below, gamma distribution is how much time one has to wait till k experiments succeed, where an experiment duration
distributes according to the exponential distribution.
i
A random variable X has the Poisson distribution, with parameter η > 0, which is a discrete distribution, if Pr[X = i] = e−η ηi! .

Lemma 14.5.2 The following properties hold for the d dimensional Gaussian distribution N d (0, 1):
(i) The distribution N d (0, 1) is centrally symmetric around the origin.
(ii) If X ∼ N d (0, 1) and u is a unit vector, then X · u ∼ N(0, 1).
(iii) If X, Y ∼ N(0, 1) are two independent variables, then Z = X 2 + Y 2 follows the exponential distribution with parameter
λ = 21 .
(iv) Given k independent variables X1 , . . . , Xk distributed according to the exponential distribution with parameter λ, then
Y = X1 + · · · + Xk is distributed according to the Gamma distribution Γλ,k (x).

Proof: (i) Let x = (x1 , . . . , xd ) be a point picked from the Gaussian distribution. √ The density φd (x) = φ(x1 )φ(x2 ) · φ(xd ), where
φ(xi ) is the normal distribution density function, which is φ(xi ) = exp(−xi2 /2)/ 2π. Thus φd (x) = (2π)−n/2 exp(−(x12 · · · + xd2 )/2).
Consider any two points x, y ∈ IRn , such that r = kxk = kyk. Clearly, φd (x) = φd (y). Namely, any two points of the same distance
from the origin, have the same density (i.e., “probability”). As such, the distribution N d (0, 1) is centrally symmetric around the
origin.
(ii) Consider e1 = (1, 0, . . . , 0) ∈ IRn . Clearly, x · e1 = x1 , which is distributed N(0, 1). Now, by the symmetry of N d (0, 1),
this implies that x · u is distributed N(0, 1). Formally, let R be a rotation matrix that maps u to e1 . We know that Rx is distributed
N d (0, 1) (since N d (0, 1) is centrally symmetric). Thus x · u has the same distribute as Rx · Ru, which has the same distribution as
x · e1 , which is N(0, 1).
(iii) If X, Y ∼ N(0, 1), and consider the integral of the density function
Z ∞ Z ∞ !
1 x2 + y2
A= exp − dx dy.
x=−∞ y=−∞ 2π 2
√ √
We would like to change the integration variables to x(r, α) = r sin α and y(r, α) = r cos α. The Jacobian of this change of
variables is sin α √
∂x
∂x
∂r ∂α
2 √r r cos α 1 1
I(r, α) = ∂y ∂y = cos α √ = − sin2 α + cos2 α = − .
∂r ∂α 2 r − r sin α
√ 2 2
As such, we have
Z !
1 x2 + y2
Pr[Z = z] = exp −
x2 +y2 =α 2π 2
Z 2π √ √ !
1 x( z, α)2 + y( z, α)2
= exp − · |I(r, α)|
α=0 2π 2
Z 2π z 1 z
1 1
= · · exp − = exp − .
2π 2 α=0 2 2 2

139
As such, Z has an exponential distribution with λ = 1/2.
k−2
(iv) For k = 1 the claim is trivial. Otherwise, let gk−1 (x) = λ (λx)
(k−2)!
exp(−λx). Observe that
Z t Z t !
(λ(t − x))k−2
gk (t) = gk−1 (t − x)g1 (x) dx = λ exp(−λ(t − x)) λ exp(−λx) dx
0 0 (k − 2)!
Z t
(λ(t − x))k−2
= λ2 exp(−λt) dx
0 (k − 2)!
Z t
(λx)k−2 (λt)k−1
= λ exp(−λt) λ dx = λ exp(−λt) = gk (x).
0 (k − 2)! (k − 1)!

14.5.2 The proof

Lemma 14.5.3 Let u be a unit vector in IRd . For any even positive integer k, let U1 , . . . , Uk be random vectors chosen independently
from the d-dimensional Gaussian distribution N d (0, 1). For Xi = u · Ui , define W = W(u) = (X1 , . . . , Xk ) and L = L(u) = kWk2 .
Then, for any β > 1, we have:
1. E[L] = k.

2. Pr L ≥ βk ≤ k+3
2
exp − 2k (β − (1 + ln β)) .

3. Pr L ≤ k/β ≤ 6k exp − 2k (β−1 − (1 − ln β)) .

2
Proof: By Lemma 14.5.2 (ii) each Xi is distributed as N(0, 1), and X1 , . . . , Xk are independent. Define Yi = X2i−1 + X2i2 , for
P
i = 1, . . . , τ, where τ = k/2. By Lemma 14.5.2 (iii) Yi follows the exponential distribution with parameter λ = 1/2. Let L = τi=1 Yi .
Pk/2
By Lemma 14.5.2 (iv), the variable L follows the Gamma distribution (k/2, 1/2), and its expectation is E[L] = i=1 E[Yi ] = 2τ = k.
Now, let η = βτ = βk/2, we have

X ηi ητ
τ
Pr L ≥ βk = 1 − Pr L ≤ βk = 1 − G1/2,τ (βk) = e−η ≤ (τ + 1)e−η ,
i=0
i! τ!

since η = βτ > τ, as β > 1. Now, since τ! ≥ (τ/e)τ , as can be easily verified° , and thus

ητ eη τ eβτ τ
Pr L ≥ βk ≤ (τ + 1)e−η = (τ + 1)e−η = (τ + 1)e−βτ
ττ /eτ τ τ
= (τ + 1)e−βτ · exp(τ ln(eβ)) = (τ + 1) exp(−τ(β − (1 + ln β)))
!
k+3 k
= exp − (β − (1 + ln β)) .
2 2

Arguing in a similar fashion, we have, for ν = d2eτe, that

!i  ν ! !i 
X
∞
(τ/β)i X ∞
eτ X eτ i X ∞
eτ 
Pr L ≤ k/β = e−τ/β ≤ e−τ/β = e−τ/β  + 
i=τ
i! i=τ
iβ i=τ
iβ i=ν+1
iβ 
 ν !  !i
X eτ i
−τ/β  1  Xν
eτ
≤ e 
 +  ≤ 2e−τ/β
,
i=τ
iβ (2β)ν  i=τ
iβ

since (eτ/τβ)τ ≥ 1/(2β)ν . As the sequence (eτ/iβ)i is decreasing for i > τ/β, as can be easily verified± , we can bound the
(decreasing) summation above by
Xν !i !τ
eτ e
≤ν = d2eτe exp(τ(1 − ln β)) .
i=τ
iβ β
We conclude !
k −1
Pr L ≤ k/β ≤ 2 d2eτe exp(−τ/β + τ(1 − ln β)) ≤ 6k exp − (β − (1 − ln β)) .
2

Pτ Rn n
°
Indeed, ln τ! = i=1 ln i ≥ x=1
ln x dx = x ln x − x = n ln n − n + 1 ≥ n ln n − n = ln((n/e)n ).
x=1
±
Indeed, consider the function f (x) = x ln(c/x), its derivative is f 0 (x) = ln(c/x) − 1, and as such f 0 (x) = 0, for x = c/e. Namely,
for c = eτ/β, the function f (x) achieves its maximum at x = τ/β, and from this point on the function is decreasing.

140
Next, we show how to interpret the inequalities of Lemma 14.5.3 in a somewhat more intuitive way. Let β = 1 + ε, for ε such
P (−1)i i+1
that 1 > ε > 0. From the Taylor expansion of ln(1 + x) = ∞ i=0 i+1 x , it follows that ln β ≤ ε − ε2 /2 + ε3 /3. By plugging it into

the upper bound for Pr L ≥ βk we get
! !
k+3 k k+3 k
Pr L ≥ βk ≤ exp − (1 + ε − 1 − ε + ε2 /2 − ε3 /3) ≤ exp − (ε2 /2 − ε3 /3)
2 2 2 2
2
On the other hand, since ln β ≥ ε − ε /2, we have Pr L ≤ k/β ≤ 6k exp(∆), where
!
k k 1 ε2
∆ = − (β−1 − (1 − ln β)) ≤ − −1+ε−
2 2 1+ε 2
!
k ε2 ε2 k ε2 − ε3
≤ − − =− ·
2 1+ε 2 2 2(1 + ε)

Thus, the probability that a given unit vector gets distorted by more than (1+ε) in any direction² grows roughly as exp(−kε2 /4),
for small ε > 0. Therefore, if we are given a set P of n points in l2 , we can set k to roughly 8 ln(n)/ε2 and make sure that with
non-zero probability we obtain projection which does not distort distances³ between any two different points from P by more than
(1 + ε) in each direction.

Theorem 14.5.4 Let P be a set of n points in IRd , 0 < ε, δ < 1/2, and k = 16 ln(n/δ)/ε2 . Let U1 , . . . , Uk be random vectors chosen
independently from the d-dimensional Gaussian distribution N d (0, 1), and let T (x) = (U1 · x, . . . , Uk · x) be a linear transformation.
Then, with probablity ≥ 1 − δ, for any p, q ∈ P, we have that
1 √ √
k kp − qk ≤ kT (p) − T (q)k ≤ (1 + ε) k kp − qk .
(1 + ε)

Sometime it is useful to be able to handle high distortion.

Corollary 14.5.5 Let k be the target dimension of the transformation T of Theorem 14.5.4 and β ≥ 3 a parameter. We have that
√
kT (p) − T (q)k ≤ β k kp − qk ,
2

for any two points p, q ∈ P, and this holds with probability ≥ 1 − exp − 32kβln n .

14.6 Bibliographical notes

Our presentation follows Matoušek [Mat02]. The Brunn-Minkowski inequality is a powerful inequality which is widely used
in mathematics. A nice survey of this inequality and its applications is provided by Gardner [Gar02]. Gardner says: “In a sea
of mathematics, the Brunn-Minkowski inequality appears like an octopus, tentacles reaching far and wide, its shape and color
changing as it roams from one area to the next.” However, Gardner is careful in claiming that the Brunn-Minkowski inequality
is one of the most powerful inequalities in mathematics since as a wit put it “the most powerful inequality is x2 ≥ 0, since all
inequalities are in some sense equivalent to it.”
A striking application of the Brunn-Minkowski inequality is the proof that in any partial ordering of n elements, there is a
single comparison that knowing its result, reduces the number of linear extensions that are consistent with the partial ordering, by a
constant fraction. This immediately implies (the uninteresting result) that one can sort n elements in O(n log n) comparisons. More
interestingly, it implies that if there are m linear extensions of the current partial ordering, we can always sort it using O(log m)
comparisons. A nice exposition of this surprising result is provided by Matoušek [Mat02, Section 12.3].
The probability review of Section 14.5.1 can be found in Feller [Fel71]. The alternative proof of the Johnson-Lindenstrauss
lemma of Section 14.5.2 is described by Indyk and Motwani [IM98] but earlier proofs are known [Dur95]. It exposes the fact
that the Johnson-Lindenstrauss lemma is no more than yet another instance of the concentration of mass phenomena (i.e., like
the Chernoff inequality). The alternative proof is provided since it is conceptually simpler (although the computations are more
involved), and it is technically easier to use. Another alternative proof is provided by Dasgupta and Gupta [DG03].
Interestingly, it is enough to pick each entry in the dimension reducing matrix randomly out of −1, 0, 1. This requires more
involved proof [Ach01]. This is useful when one care about storing this dimension reduction transformation efficiently.
Magen [Mag02] observed that in fact the JL lemma preserves angles, and in fact can be used to preserve any “k dimensional
angle”, by projecting down to dimension O(kε−2 log n). In particular, Exercise 14.7.5 is taken from there.
²
Note that this implies distortion (1 + ε)2 if we require the mapping to be a contraction.
³
In fact, this statement holds even for the square of the distances.

141
In fact, the random embedding preserves much more structure than just distances between points. It preserves the structure
and distances of surfaces as long as they are low dimensional and “well behaved”, see [AHY07] for some results in this direction.
Dimension reduction is crucial in learning, AI, databases, etc. One common technique that is being used in practice is to
do PCA (i.e., principal component analysis) and take the first few main axises. Other techniques include independent component
analysis, and MDS (multidimensional scaling). MDS tries to embed points from high dimensions into low dimension (d = 2 or 3),
which preserving some properties. Theoretically, dimension reduction into really low dimensions is hopeless, as the distortion in
the worst case is Ω(n1/(k−1) ), if k is the target dimension [Mat90].

14.7 Exercises
Exercise 14.7.1 (Boxes can be separated.) [1 Points]
(Easy.) Let A and B be two axis-parallel boxes that are interior disjoint. Prove that there is always an axis-parallel hyperplane
that separates the interior of the two boxes.

Exercise 14.7.2 (Brunn-Minkowski inequality slight extension.) [4 Points]

Corollary 14.7.3 For A and B compact sets in IRn , we have for any λ ∈ [0, 1] that Vol(λA + (1 − λ)B) ≥
Vol(A)λ Vol(B)1−λ .

Exercise 14.7.4 (Projections are contractions.) [1 Points]

(Easy.) Let F be a k-dimensional affine subspace, and let PF : IRd → F be the projection that maps every point x ∈ IRd to its
nearest neighbor on F. Prove that p is a contraction (i.e., 1-Lipschitz). Namely, for any p, q ∈ IRd , it holds that kPF (p) − PF (q)k ≤
k p − q k.

Exercise 14.7.5 (JL Lemma works for angles.) [10 Points]

Show that the Johnson-Lindenstrauss lemma also (1 ± ε)-preserves angles among triples of points of P (you might need to
increase the target dimension however by a constant factor). [Hint: For every angle, construct a equilateral triangle that its edges
are being preserved by the projection (add the vertices of those triangles [conceptually] to the point set being embedded). Argue,
that this implies that the angle is being preserved.]

142
Chapter 15

ANN in High Dimensions

15.1 ANN on the Hypercube

15.1.1 Hypercube and Hamming distance
Definition 15.1.1 The set of points Hd = {0, 1}d is the d-dimensional hypercube. A point p = (p1 , . . . , pd ) ∈ Hd can be interpreted,
naturally, as a binary string p1 p2 . . . pd . The Hamming distance dH (p, q) between p, q ∈ Hd , is the number of coordinates where p
and q disagree.
It is easy to verify that the Hamming distance comply with the triangle inequality, and is as such a metric.

As we saw in previous lectures, all we need to solve (1 + ε)-ANN, is it is enough to efficiently solve the approximate near
neighbor problem. Namely, given a set P of n points in Hd , and radius r > 0 and parameter ε > 0, we want to decide for a query
point q whether dH (q, P) ≤ r or dH (q, P) ≥ (1 + ε)r.

Definition 15.1.2 For a set P of points, a data-structure NNbr≈ (P, r, (1 + ε)r) solves the approximate near neighbor problem, if
given a query point q, the data-structure works as follows.
• If d(q, P) ≤ r then NNbr≈ outputs a point p ∈ P such that d(p, q) ≤ (1 + ε)r.
• If d(q, P) ≥ (1 + ε)r, in this case NNbr≈ outputs that “d(q, P) ≥ r”.
• If r ≤ d(q, P) ≤ (1 + ε)r, either of the above answers is acceptable.

Given such a data-structure NNbr≈ (P, r, (1 + ε)r), one can construct a data-structure that answers ANN using O(log(n/ε))
queries.

15.1.2 Constructing NNbr for the Hamming cube

Let P = {p1 , . . . , pn } be a subset of vertices of the hypercube in d dimensions. In the following we assume that d = nO(1) . Let
r, ε > 0 be two prespecified parameters. We are interested in building an NNbr≈ for balls of radius r in the Hamming distance.

Definition 15.1.3 Let U be a (small) positive integer. A family F = {h : S → [0, U]} of functions, is an (r, R, α, β)-sensitive if for
any u, q ∈ S , we have:
• If u ∈ b(q, r) then Pr[h(u) = h(v)] ≥ α.
• If u < b(q, R) then Pr[h(u) = h(v)] ≤ β,
where h is randomly picked from F, r < R, and α > β.

Intuitively, if we can construct a (r, R, α, β)-sensitive family, then we can distinguish between two points which are close
together, and two points which are far away from each other. Of course, the probabilities α and β might be very close to each other,
and we need a way to do amplification.

Lemma 15.1.4 For the hypercube Hd = {0, 1}d , and a point b = (b1 , . . . , bd ) ∈ Hd , let F be the set of functions

hi (b) = bi b = (b1 , . . . , bd ) ∈ Hd , for i = 1, . . . , d .
r(1+ε)

Then for any r, ε, the family F is r, (1 + ε)r, 1 − dr , 1 − d
-sensitive.

143
Proof: If u, v ∈ {0, 1}d are in distance smaller than r from each other (under the Hamming distance), then they differ in at most
r coordinates. The probability that h ∈ F would project into a coordinate that u and v agree on is ≥ 1 − r/d.
Similarly, if dH (u, v) ≥ (1 + ε)r then the probability that h would map into a coordinate that u and v agree on is ≤ 1 − (1 + ε)r/d.

Let k be a parameter to be specified shortly. Let

G(F) = g : {0, 1}d → {0, 1}k g(u) = (h1 (u), . . . , hk (u)), for h1 , . . . , hk ∈ F .

Intuitively, G is a family that extends F by probing into k coordinates instead of only one coordinate.

15.1.3 Construction the near-neighbor data-structure

Let τ be (yet another) parameter to be specified shortly. We pick g1 , . . . , gτ functions randomly and uniformly from G. For each
point u ∈ P compute g1 (u), . . . , gτ (u). We construct a hash table Hi to store all the values of gi (p1 ), . . . , gi (pn ), for i = 1, . . . , τ.
Given a query point q ∈ Hd , we compute p1 (q), . . . , pτ (q), and retrieve all the points stored in those buckets in the hash tables
H1 , . . . , Hτ , respectively. For every point retrieved, we compute its distance to q, and if this distance is ≤ (1 + ε)r, we return it. If
we encounter more than 4τ points we abort, and return ‘fail’. If no “close” point is encountered, the search returns ‘fail’.
We choose k and τ so that with constant probability (say larger than half) we have the following two properties:
(1) If there is a point u ∈ P, such that dH (u, q) ≤ r, then g j (u) = g j (q) for some j.
(2) Otherwise (i.e., dH (u, q) ≥ (1 + ε)r), the total number of points colliding with q in the τ hash tables, is smaller than 4τ.
Given a query point, q ∈ Hd , we need to perform τ probes into τ hash tables, and retrieve at most 4τ results. Overall this takes
O(1)
O(d τ) time.

Lemma 15.1.5 If thereis a (r, (1+ε)r, α, β)-sensitive family F of functions for the hypercube, then there exists a NNbr≈ (P, r, (1+ε)r)
which uses O dn + n1+ρ space and O(nρ ) hash probes for each query, where

ln 1/α
ρ= .
ln 1/β
This data-structure succeeds with constant probability.

Proof: It suffices to ensure that properties (1) and (2) holds with probability larger than half.
ln n
Set k = log1/β n = ln(1/β) , then the probability that for a random hash function g ∈ G(F), we have g(p) = g(q) for p ∈
P \ b(q, (1 + ε)r) is at most
1
Pr g(p0 ) = g(q) ≤ βk ≤ exp ln(β) · ln(1/β)ln n
≤ .
n
Thus, the expected number of elements from P \ b(q, (1 + ε)r) colliding with q in the jth hash table H j is bounded by one. In
particular, the overall expected number of such collisions in H1 , . . . , Hτ is bounded by τ. By the Markov inequality we have that
the probability that the collusions number exceeds 4τ is less than 1/4; therefore the probability that the property (2) holds is ≥ 3/4.
Next, for a point p ∈ b(q, r), consider the probability of g j (p) = g j (q), for a fixed j. Clearly, it is bounded from below by
ln 1/α
≥ αk = αlog1/β n = n− ln 1/β = n−ρ .

Thus the probability that such a g j exists is at least 1 − (1 − n−ρ )τ . By setting τ = 2nρ we get property (1) holds with probability
≥ 1 − 1/e2 > 4/5. The claim follows.

ln(1 − x) 1
Claim 15.1.6 For x ∈ [0, 1) and t ≥ 1 such that 1 − tx > 0 we have ≤ .
ln(1 − tx) t
Proof: Since ln(1 − tx) < 0, it follows that the claim is equivalent to t ln(1 − x) ≥ ln(1 − tx). This in turn is equivalent to

g(x) ≡ (1 − tx) − (1 − x)t ≤ 0.

This is trivially true for x = 0. Furthermore, taking the derivative, we see g0 (x) = −t + t(1 − x)t−1 , which is non-positive for x ∈ [0, 1)
and t ≥ 1. Therefore, g is non-increasing in the region in which we are interested, and so g(x) ≤ 0 for all values in this interval.

Lemma 15.1.7 There exists a NNbr≈ (r, (1 + ε)r) which uses O dn + n1+1/(1+ε) space and O(n1/(1+ε) ) hash probes for each query.
The probability of success (i.e., there is a point u ∈ P such that dH (u, q) ≤ r, and we return a point v ∈ P such that kuvk ≤ (1 + ε)r)
is a constant.

144
r r(1+ε)
Proof: By Lemma 15.1.4, we have a (r, (1 + ε)r, α, β)-sensitive family of hash functions, where α = 1 − d
and β = 1 − d
.
As such
r
ln 1/α ln α ln d−r
d
ln 1 − d 1
ρ= = = d−(1+ε)r = ≤ ,
ln 1/β ln β ln ln 1 − (1 + ε) r 1 + ε
d d

by Claim 15.1.6.
By building O(log n) structures of Lemma 15.1.7, we can do probability amplification and get a correct result with High
probability.

Theorem 15.1.8 Given a set P of n points on the hypercube Hd , parameters ε > 0 and r > 0, one can build a NNbr≈ =
NNbr≈ (P, r, (1 + ε)r), such that given a query point q, one can decide if:
• b(q, r) ∩ P , ∅, then NNbr≈ returns a point u ∈ P, such that dH (u, q) ≤ (1 + ε)r.
• b(q, (1 + ε)r) ∩ P = ∅ then NNbr≈ returns that no point is in distance ≤ r from q.

In any other case, any of the answers is correct. The query time is O(dn1/(1+ε) log n) and the space used is O dn + n1+1/(1+ε) log n .
The result returned is correct with high probability.

Proof: Note, that every point can be stored only once. Any other reference to it in the data-structure can be implemented with a
pointer. Thus, the O(dn) requirement on the space. The other term follows by repeating the space requirement of Lemma 15.1.7
O(log n) times.
In the hypercube case, when d = nO(1) , we can just build M = O(ε−1 log n) such data-structures such that (1 + ε)-ANN can be
answered using binary search on those data-structures, which corresponds to radiuses r1 , . . . , r M , where ri = (1 + ε)i .

d O(1)
Theorem 15.1.9 Given a set P of n points on the
hypercube H (where d = n ) parameters ε > 0 and r > 0, one can build ANN
1/(1+ε) −1 2
data-structure using O (d + n )ε n log n space, such that given a query point q, one can returns an ANN in P (under the
Hamming distance) in O(dn1/(1+ε) log(ε−1 log n)) time. The result returned is correct with high probability.

15.2 LSH and ANN in Euclidean Space

15.2.1 Preliminaries
Lemma 15.2.1 Let X = (X1 , . . . , Xd ) be a vector of d independent variables which have distribution N(0, 1), and let v = (v1 , . . . , vd ) ∈
P
IRd . We have that v · X = i vi Xi is distributed as kvk Z, where Z ∼ N(0, 1).

Proof: If kvk = 1 then this holds by the symmetry of the normal distribution. Indeed, let e1 = (1, 0, . . . , 0). By the symmetry of
the d-dimensional normal distribution, we have that v · X ∼ e1 · X =
X1 ∼ N(0, 1).
Otherwise, v · X/ kVk ∼ N(0, 1), and as such v · X ∼ N 0, kvk2 , which is indeed the distribution of kvk Z.
A d-dimensional distribution that has the property of Lemma 15.2.1, is called a 2-stable distribution.

15.2.2 Locality Sensitive Hashing

Let q, r be two points in IRd . We want to perform an experiment to decide if kq − rk ≤ 1 or kq − rk ≥ η, where η = 1 + ε. We will
randomly choose a vector ~v from the d-dimensional normal distribution N d (0, 1) (which is 2-stable). Next, let r be a parameter, and
let t be a random number chosen uniformly from the interval [0, r]. For p ∈ IRd , and consider the random hash function
$ %
p · ~v + t
h(p) = . (15.1)
r

If p and q are in distance η from each other, and when we project to ~v, the distance between the projection is t, then the
probability that they get the same hash value is 1 − t/r, since this is the probability that the random sliding will not separate them.
As such, we have that the probability of collusion is
Z r i
h t
α(η) = Pr h(p) = h(q) = Pr p · ~v − q · ~v = t 1 − dt.
t=0 r

However, since ~v is chosen from a 2-stable distribution, we have that p ·~v − q ·~v = (p − q) ·~v ∼ N(0, kpqk2 ). Since we are considering
the absolute value of the variable, we need to multiply this by two. Thus, we have
Z r !
2 t2 t
α(η, r) = √ exp − 1− dt.
t=0 2πη 2η 2 r

145
Intuitively, we care about the difference α(1 + ε, r) − α(1, r), and we would like to maximize it as much as possible (by choosing
the right value of r). Unfortunately, this integral is unfriendly, and we have to resort to numerical computation.
In fact, if are going to use this hashing scheme for constructing locality sensitive hashing, like in Lemma 15.1.5, then we care
about the ratio
log(1/α(1))
ρ(1 + ε) = min .
r log(1/α(1 + ε))

The following is verified using numerical computations on a computer,

1
Lemma 15.2.2 ([DNIM04]) One can choose r, such that ρ(1 + ε) ≤ 1+ε
.

Lemma 15.2.2 implies that the hash functions defined by Eq. (15.1) are (1, 1 + ε, α0 , β0 )-sensitive, and furthermore, ρ =
log(1/α0 ) 1
≤ 1+ε
log(1/β0 )
, for some values of α0 and β0 . As such, we can use this hashing family to construct NNbr≈ for the set P of points in
d
IR . Following the same argumentation of Theorem 15.1.8, we have the following.

Theorem 15.2.3 Given a set P of n points in IRd , parameters ε > 0 and r > 0, one can build a NNbr≈ = NNbr≈ (P, r, (1 + ε)r), such
that given a query point q, one can decide if:
• b(q, r) ∩ P , ∅, then NNbr≈ returns a point u ∈ P, such that dH (u, q) ≤ (1 + ε)r.
• b(q, (1 + ε)r) ∩ P = ∅ then NNbr≈ returns that no point is in distance ≤ r from q.

In any other case, any of the answers is correct. The query time is O(dn1/(1+ε) log n) and the space used is O dn + n1/(1+ε) n log n .
The result returned is correct with high probability.

15.2.3 ANN in High Dimensional Euclidean Space

Unlike the hypercube case, where we could just do direct binary search on the distances. Here we need to use the reduction from
ANN to near-neighbor queries. We will need the following result (which follows from what we had seen in previous lectures).

Theorem 15.2.4 Given a set P of n points in IRd , then one can construct data-structures D that answers (1 + ε)-ANN queries, by
performing O(log(n/ε)) NNbr≈ queries. The total number of points stored at NNbr≈ data-structures of D is O(nε−1 log(n/ε)).

Constructing the data-structure of Theorem 15.2.4 requires building a low quality HST. Unfortunately, the previous construc-
tion seen for HST are exponential in the dimension, or take quadratic time. We next present a faster scheme.

15.2.3.1 Low quality HST in high dimensional Euclidean space

Lemma 15.2.5 Let P be a set of n in IRd . One can compute a nd-HST of P in O(nd log2 n) time (note, that the constant hidden by
the O notation does not depend on d).

Proof: Our construction is based on a recursive decomposition of the point-set. In each stage, we split the point-set into two
subsets. We recursively compute a nd-HST for each point-set, and we merge the two trees into a single tree, by creating a new
vertex, assigning it an appropriate value, and hung the two subtrees from this node. To carry this out, we try to separate the set into
two subsets that are furthest away from each other.
P
Let R = R(P) be the minimum axis parallel box containing P, and let ν = l(P) = di=1 kIi (R)k, where Ii (R) is the projection of
R to the ith dimension.
Clearly, one can find an axis parallel strip H of width ≥ ν/((n − 1)d), such that there is at least one point of P on each of its
sides, and there is no points of P inside H. Indeed, to find this strip, project the point-set into the ith dimension, and find the longest
interval between two consecutive points. Repeat this process for i = 1, . . . , d, and use the longest interval encountered. Clearly, the
strip H corresponding to this interval is of width ≥ ν/((n − 1)d). On the other hand, diam(P) ≤ ν.
Now recursively continue the construction of two trees T + , T − , for P+ , P− , respectively, where P+ , P− is the splitting of P into
two sets by H. We hung T + and T − on the root node v, and set ∆v = ν. We claim that the resulting tree T is a nd-HST. To this end,
observe that diam(P) ≤ ∆v , and for a point p ∈ P− and a point q ∈ P+ , we have kpqk ≥ ν/((n − 1)d), which implies the claim.
To construct this efficiently, we use an efficient search trees to store the points according to their order in each coordinate. Let
D1 , . . . , Dd be those trees, where Di store the points of P in ascending order according to the ith axis, for i = 1, . . . , d. We modify
them, such that for every node v ∈ Di , we know what is the largest empty interval along the ith axis for the points Pv (i.e., the points
stored in the subtree of v in Di ). Thus, finding the largest strip to split along, can be done in O(d log n) time. Now, we need to
split the d trees into two families of d trees. Assume we split according to the first axis. We can split D1 in O(log n) time using the
splitting operation provided by the search tree (Treaps for example can do this split in O(log n) time). Let assume that this split P
into two sets L and R, where |L| < |R|.
We still need to split the other d − 1 search trees. This is going to be done by deleting all the points of L from those trees, and
building d − 1 new search trees for L. This takes O(|L| d log n) time. We charge this work to the points of L.

146
Since in every split, only the points in the smaller portion of the split get charged, it follows that every point can be charged at
most O(log n) time during this construction algorithm. Thus, the overall construction time is O(dn log2 n) time.

15.2.3.2 The overall result

Plugging Theorem 15.2.3 into Theorem 15.2.4, we have:

Theorem 15.2.6 Given a set P of n points in IRd , parameters ε > 0 and r > 0, one can build ANN data-structure using

O dn + n1+1/(1+ε) ε−2 log3 (n/ε)

space, such that given a query point q, one can returns an (1 + ε)-ANN in P in

n
O dn1/(1+ε) log n log
ε

time. The result returned is correct with high probability.

The construction time is O dn1+1/(1+ε) ε−2 log3 (n/ε) .

Proof: We compute the low quality HST using Lemma 15.2.5. This takes O(nd log2 n) time. Using this HST, we can construct
the data-structure D of Theorem 15.2.4, where we do not compute the NNbr≈ data-structures. We next traverse the tree D, and
construct the NNbr≈ data-structures using Theorem 15.2.3.
We only need to prove the bound on the space. Observe, that we need to store each point only once, since other place can
refer to the point by a pointer. Thus, this is the O(nd) space requirement. The other term comes from plugging the bound of
Theorem 15.2.4 into the bound of Theorem 15.2.3.

15.3 Bibliographical notes

Section 15.1 follows the exposition of Indyk and Motwani [IM98]. The fact that one can perform approximate nearest neighbor in
high dimensions in time and space polynomial in the dimension is quite surprising, One can reduce the approximate near-neighbor
in euclidean space to the same question on the hypercube (we show the details below). This implies together with the reduction
from ANN to approximate near-neighbor (seen in previous lectures) that one can answer ANN in high dimensional euclidean space
with similar performance. Kushilevitz, Ostrovsky and Rabani [KOR00] offered an alternative data-structure with somewhat inferior
performance.
The value of the results showed in this write-up depend to large extent on the reader perspective. Indeed, for small value of
ε > 0, the query time O(dn1/(1+ε) ) is very close to linear dependency on n, and is almost equivalent to just scanning the points.
Thus, from low dimension perspective, where ε is assumed to be small, this result is slightly sublinear. On the other hand, if one is
willing to pick ε to be large (say 10), then the result is clearly better than the naive algorithm, suggesting running time for an ANN
query which takes (roughly) n1/11 .
The idea of doing locality sensitive hashing directly on the Euclidean space, as done in Section 15.2 is not shocking after
seeing the Johnson-Lindenstrauss lemma. It is taken from a recent paper of Datar et al. [DNIM04]. In particular, the current
analysis which relies on computerized estimates is far from being satisfactory. It would be nice to have a simpler and more elegant
scheme for this case. This is an open problem for further research. Another open problem is to improve the performance of the
LSH scheme.
The low-quality high-dimensional HST construction of Lemma 15.2.5, is taken from [Har01b]. The running time of this lemma
can be further improved to O(dn log n) by more careful and involved implementation, see [CK95] for details.

147
From approximate near-neighbor in IRd to approximate near-neighbor on the hypercube. The reduction is
quite involved, and we only sketch the details. Let P Be a set of n points in IRd . We first reduce the dimension
0
to k = O(ε−2 log n) using the Johnson-Lindenstrauss lemma. Next, we embed this space into `1k (this is the space
k 0 2
IR , where distances are the L1 metric instead of the regular L2 metric), where k = O(k/ε ). This can be done with
distortion (1 + ε).
0
Let Q the resulting set of points in IRk . We want to solve NNbr≈ on this set of points, for radius r. As a first step,
we partition the space into cells by taking a grid with sidelength (say) k0 r, and randomly translating it, clipping the
points inside each grid cell. It is now sufficient to solve the NNbr≈ inside this grid cell (which has bounded diameter as
a function of r), since with small probability that the result would be correct. We amplify the probability by repeating
this polylogarithmic number of times.
0
Thus, we can assume that P is contained inside a cube of side length ≤ k0 nr, and it is in IRk , and the distance
metric is the L1 metric. We next, snap the points of P to a grid of sidelength (say) εr/k0 . Thus, every point of P now
has an integer coordinate, which is bounded by a polynomial in log n and 1/ε. Next, we write the coordinates of the
points of P using unary notation. (Thus, a point (2, 5) would be written as (010, 101) assuming the number of bits for
each coordinates is 3.) It is now easy to verify that the hamming distance on the resulting strings, is equivalent to the
L1 distance between the points.
Thus, we can solve the near-neighbor problem for points in IRd by solving it on the hypercube under the Hamming
distance.
See Indyk and Motwani [IM98] for more details.

This relationship indicates that the ANN on the hypercube is “equivalent” to the ANN in Euclidean space. In particular, making
progress on the ANN on the hypercube would probably lead to similar progress on the Euclidean ANN problem.
We had only scratched the surface of proximity problems in high dimensions. The interested reader is referred to the survey
by Indyk [Ind04] for more information.

148
Chapter 16

Approximating a Convex Body by An

Ellipsoid

16.1 Some Linear Algebra

In the following, we cover some material from linear algebra. Proofs of those facts can be found in any text on linear algebra, for
example [Leo98].
For a matrix A, let AT denote the transposed matrix. We remind the reader that for two matrices A and B, we have (AB)T =
T T
B A . Furthermore, for any three matrices A, B and C, we have (AB)C = A(BC).
A matrix A ∈ IRn×n is symmetric if AT = A. All the eigenvalues of a symmetric matrix are real numbers. A matrix A is positive
definite if xT Ax > 0, for all x ∈ IRn . Among other things this implies that A is non-singular. If A is symmetric then it is positive
definite if and only if all its eigenvalues are positive numbers.
In particular, if A is symmetric positive definite then det(A) > 0.
For two vectors u, v ∈ IRn , let hu, vi = uT v denote their dot product.

16.2 Ellipsoids

Definition 16.2.1 Let b = x kxk ≤ 1 be the unit ball. Let a ∈ IRn be a vector and let T : IRn → IRn be an invertible linear
transformation. The set
E = T (b) + a
is called an ellipsoid and a is its center.

Alternatively, we can write

E = x ∈ IRn T −1 (x − a) ≤ 1 .

However,
−1 D E T
T (x − a) = T −1 (x − a), T −1 (x − a) = (T −1 (x − a))T T −1 (x − a) = (x − a)T T −1 T −1 (x − a).
T
In particular, let Q = T −1 T −1 . Observe that Q is symmetric, and that it is positive definite. Thus,

E = x ∈ IRn (x − a)T Q(x − a) ≤ 1 .

If we change the basis of IRn to be the set of unit eigenvectors of E, then Q becomes a diagonal matrix, and we have that

E = (y1 , . . . , yn ) ∈ IRn λ1 (y1 − a1 )2 + · · · + λn (yn − an )2 ≤ 1 ,
√
where a = (a1 , . . . , an ) and λ1 , . . . , λn are the eigenvalues of Q. In particular, this implies that the point a1 , . . . , ai ± 1/ λi , . . . , an ∈
∂E, for i = 1, . . . , n. In particular,
Vol(b) Vol(b)
Vol(E) = √ √ = √ .
λ1 · · · λn det(Q)
For a convex body K (i.e., a convex and bounded set), let E be a largest volume ellipsoid contained inside K. One can show
that E is unique. Namely, there is a single maximum volume ellipsoid inside K.

149
Theorem 16.2.2 Let K⊂ IR n be aconvex body, and let E ⊆ K be its maximum volume ellipsoid. Suppose that E is centered at the

origin, then K ⊆ n E = nx x ∈ E .

Proof: By applying a linear transformation, we can assume that E is the unit ball b. And assume for the sake of contradiction
that there is a point p ∈ K, such that kpk > n. Consider the set C which is the convex-hull of {p} ∪ b. Since K is convex, we have
that C ⊆ K.
We will reach a contradiction, by finding an ellipsoid G which has volume larger than b and is enclosed inside C.
By rotating space, we can assume that the apex p of C is the 



point (ρ, 0, . . . , 0), for ρ > n. We consider ellipsoids of the form 
  β

  
 z α
1 X 2
(y1 − τ)2 n

 
 (−1, 0) o }| {
G=  (y1 , . . . , yn ) + 2 yi ≤ 1 
 .
 α 2 β  (τ, 0)
i=2
p = (ρ, 0)
We need to pick the values of τ, α and β such that G ⊆ C. Ob-
serve that by symmetry, it is enough to enforce that G ⊆ C in the
first two dimensions.
Thus, we can consider C and G to be in two dimensions. Now, the center of G is going to be on the x-axis, at the point (τ, 0).
The set E is just an ellipse with axises parallel to x and y. In particular, we require that (−1, 0) would be on the boundary of G. This
implies that (−1 − τ)2 = α2 and that 0 ≤ τ ≤ (ρ + (−1))/2. Namely,
α = 1 + τ. (16.1)
In particular, the equation of curve forming (the boundary) of G is
(x − τ)2 y2
F(x, y) = + − 1 = 0.
(1 + τ)2 β2

q
r
`
We next compute the value of β2 . Consider the tangent ` to
the unit circle that passes through p = (ρ, 0). Let q be the point (u, v)
where ` touches the unit circle. See figure on the right. We have
4poq , 4oqr. As such (−1, 0) o
kr − qk kq − ok (τ, 0)
= . p = (ρ, 0)
kq − ok ko − pk
G
Since kq − ok = 1, we have kq − rk = 1/ρ. Furthermore, since q
is on the unit circle, we have C

p
q = 1/ρ, 1 − 1/ρ2 .
As such, the equation of the line ` is
x p 1 ρ
q, (x, y) = 1 ⇒ + 1 − 1/ρ2 y = 1 ⇒ `≡y=−p x− p .
ρ ρ2 − 1 ρ2 − 1
Next, consider the tangent `0 to G at (u, v). We will drive a formula for `0 as a function of (u, v) and then require that ` = `0 . The
slope of the `0 is the slope of the tangent to G at (u, v), which is
! !
dy Fu (u, v) 2(u − τ) 2v β2 (u − τ)
=− =− / =−
dx Fv (u, v) (1 + τ) 2 β 2 v(1 + τ)2
by computing the derivatives of the implicit function F. The line `0 is just
(u − τ) v
(x − τ) + 2 y = 1,
α2 β
β2 (u − τ) β2
since it has the required slope and it passes through (u, v). Namely `0 ≡ y = − (x − τ) + . Setting ` = `, we have that the
vα 2 v
line `0 passes through (ρ, 0). As such,
(ρ − τ)(u − τ) (u − τ) α (u − τ)2 α2
=1 ⇒ = ⇒ = . (16.2)
α2 α ρ−τ α2 (ρ − τ)2

150
Since ` and `0 are the same line, we have that
β2 (u − τ) 1
= p .
vα2 ρ2 − 1
However,
β2 (u − τ) β2 (u − τ) β2 α β2
= · = · = .
vα2 vα α vα ρ − τ v(ρ − τ)
Thus,
β2 ρ−τ
= p .
v ρ2 − 1
v2 ρ2 − 1 v2 ρ2 − 1 (u − τ)2
Squaring and inverting both sides, we have = and thus 2 = β2 . The point (u, v) ∈ ∂G, and as such +
β 4 (ρ − τ)2 β (ρ − τ)2 α2
v2
= 1. Using Eq. (16.2) and the above, we get
β2

α2 ρ2 − 1
+ β2 = 1,
(ρ − τ)2 (ρ − τ)2
and thus
(ρ − τ)2 − α2 (ρ − τ)2 − (τ + 1)2 (ρ − 2τ − 1)(ρ + 1) ρ − 2τ − 1 2τ
β2 = = = = =1− .
ρ2 − 1 ρ2 − 1 ρ2 − 1 ρ−1 ρ−1
Namely, for n ≥ 2, and 0 ≤ τ < (ρ − 1)/2, we have that the ellipsoid G is defined by the parameters α = 1 + τ and β. It is
contained inside the “cone” C. It holds that Vol(G) = βn−1 α Vol(b), and thus
Vol(G) n−1
µ = ln = (n − 1) ln β + ln α = ln β2 + ln α.
Vol(b) 2
2 2 3
For τ > 0 sufficiently small, we haveln α = ln(1
+ τ) = τ + O(τ ), because of the Taylor expansion ln(1 + x) = x − x /2 + x /3 − · · · ,
2 2τ 2τ 2
for −1 < x ≤ 1. Similarly, ln β = ln 1 − ρ−1 = − ρ−1 + O(τ ). Thus,
! !
n−1 2τ n−1
µ= − + O(τ2 ) + τ + O(τ2 ) = 1 − τ + O(τ2 ) > 0,
2 ρ−1 ρ−1
for τ sufficiently small and if ρ > n. Thus, Vol(G)/ Vol(b) = exp(µ) > 1 implying that Vol(G) > Vol(b). A contradiction.
A convex body K
√ centered at the origin is symmetric if p ∈ K, implies that −p ∈ K. Interestingly, the constant in Theorem 16.2.2
can be improved to n in this case. We omit the proof, since it is similar to the proof of Theorem 16.2.2.

Theorem 16.2.3 Let K ⊂ IRn be√a symmetric convex body, and let E ⊆ K be its maximum volume ellipsoid. Suppose that E is
centered at the origin, then K ⊆ n E.

16.3 Bibliographical notes

We closely follow the exposition of Barvinok [Bar02]. One can approximate the John ellipsoid using the interior point techniques
[GLS88]. However, this is not very efficient in low dimensions. In the next lecture, we will show how to do this by other (simpler)
techniques, which are faster (and simpler) for point sets in low dimensions.
The maximum volume ellipsoid is sometimes referred to as John’s ellipsoid. In particular, Theorem 16.2.2 is known as John’s
theorem, and was originally proved by Fritz John [Joh48].
There are numerous interesting results in convexity theory about how convex shapes looks like. One of the surprising results
is Dvoretsky’s theorem, which states that any symmetric convex body K around the origin (in “high enough” dimension) can be cut
by a k-dimensional subspace, such that the intersection of K with this k-dimensional space, contains an ellipsoid E and is contained
inside an ellipsoid (1 + ε)E (k here is an arbitrary parameter). Since one can define a metric in a Banach space by providing a
symmetric convex body which defines the unit ball, this implies that any high enough dimensional Banach space contains (up to
translation that maps the ellipsoid to a ball) a subspace which is almost Euclidean.
A survey of those results is provided by Ball [Bal97].

151
152
Chapter 17

Approximation via Reweighting

In addition, the sirloin which I threw overboard, instead of drifting off into the void, didn’t seem to want to leave
the rocket and revolved about it, a second artificial satellite, which produced a brief eclipse of the sun every eleven
minutes and four seconds. To calm my nerves I calculated till evening the components of its trajectory, as well as
the orbital perturbation caused by the presence of the lost wrench. I figured out that for the next six million years the
sirloin, rotating about the ship in circular path, would lead the wrench, then catch up with it from behind and pass it
again.
– The Star Diaries, Stanislaw Lem.

In this chapter, we will introduce a powerful technique for “structure” approximation. The basic idea is to perform a search by
assigning elements weights, and picking the elements according to their weights. Elements weight indicates their importance. By
repeatedly picking elements according to their weights, and updating the weights of objected that are being neglected (i.e., they are
more important than the current weights indicate), we end up with a structure that has some desired properties.
We will demonstrate this technique for two problems. In the first problem, we will compute a spanning tree of points that
has low stabbing number. In the second problem, we will show how the Set Cover problem can be approximated efficiently in
geometric settings, yielding a better bound than the general approximation algorithm for this problem.

17.1 Computing a spanning tree with low stabbing number

Let P be a set of n points in the plane. We
√ would like to compute a tree T√that span
√ the points of P such that every line in the plane
crosses the edges of the tree at most O( n) times. If the points are the n × n grid, this easily holds, if we pick any spanning
tree that connects only points that are adjacent on the grid. It is not hard to show that one can not do better (see Exercise 17.4.2).

17.1.1 The algorithm

So, consider the set of all separating lines b L of P. Formally, we will consider two lines ` and `0 be equivalent, if the closed
halfplane above ` contains the same set of points as the closed halfplane above `0 (we assume no two points in P have the same x
coordinate, and as such we can ignore vertical lines). Let v denote this equivalent relation. For each equivalent class of b L pick one
representative line into a set L. Alternatively, L is just all the set of lines that passes through two points of P (it is easy to see that
under this definition the set L might contain some lines that are equivalent under v; this is a minor technicality that we shell ignore
since L contains representative for each one of the equivalence classes). As such, the set L contains at most n2 lines.
Let Ei be the set of edges added by the algorithm by the end of the ith iteration. And consider √ a candidate edge (i.e., segment)
s = qr, and consider a line ` ∈ L. If ` already intersects k edges of Ei and k is large (i.e., Ω( n)) then we would like to discourage
the usage of s in the tree T. The problem is that the desirability of an edge is determined by the lines that intersect them, as every
line knows how much load it is carrying, and how close it to the limit. Intuitively, as the load of a line get higher, edges that crosses
this line becomes less desirable.
To this end, let us define weights over the lines. The weight ω` of a line ` in the beginning would be initialized
to 1. The weight

of the line at the beginning of the ith iteration would be ω`,i−1 = 2 ni−1 (`)
, where ni−1 (`) = s ∈ Ei−1 sec ∩` , ∅ is the number of

segments of Ei−1 that intersects `. Thus, the weight of a segment s, in the beginning of the ith iteration, is
X
ω s,i = ω`,i−1 .
`∈L,`∩s,∅

153
Clearly, the heavier a segment s is the less desirable it is to be used for the spanning tree. As such, we would always pick an edge qr
such that q, r ∈ P belong to two different connected components of the forest induced by Ei−1 , and its weight is minimal among all
such edges. We repeat this process till we end up with a spanning tree of P. To simplify the implementation of the algorithm, when
adding s to the set of edges in the forest, we also remove one its endpoints from P (i.e., every connect component of our forest has
a single representative point). Thus, the algorithm terminates when P has a single point in it.
We claim that the resulting spanning tree has the required properties. Clearly, the algorithm can be implemented in polynomial
time since it performs n − 1 iterations and as such, the largest weight used is ≤ 2n . But such numbers can be manipulated in
polynomial time, and as such the running time is polynomial.

17.1.2 Proof of correctness

A key concept in our discussion (which was implicit above) is the crossing distance. The crossing distance dQ(p, q) between two
points p and q is the minimum number of lines one has to cross (i.e., cut) to get from p to q. It is easy to verify that this is a metric
(i.e., the triangle inequality holds for it).

Lemma 17.1.1 Let P be a set of n points in the plane, and let L be a set of lines in the plane with total weight W. One can always
find a pair of points
√ q and r in P, such that the total weight of the segment s = qr (i.e., the total weight of the lines of L intersecting
s) is at most cW/ n, for some constant c.

Proof: First, since the weights considered by us are always integers, we can consider all the weights to be 1 by replacing a
line ` of weight ω` by ω` copies of it. Perturb slightly the lines, so that there is no pair of them which is parallel. Next, consider a
point q ∈ IR2 , and all the vertices of the arrangement of A = A(L). Consider the ball b(q, r) of all vertices of the arrangement A in
crossing distance at most ≤ r from q.
We claim that |b(q, r)| ≥ r2 /8. Indeed, one can shoot a ray ζ from q that intersects at least W/2 lines of L. Let `1 , . . . , `r/2 be
the first r/2 lines hit by the ray ζ, and let r1 , . . . , rr/2 be the respective intersection points between these lines and ζ. Now, mark all
the intersection points of the arrangement A(L) along the line `i that are in distance at most r/2 from ri , for i = 1, . . . , r/2. Clearly,
we overall marked at least (r/2)(r/2)/2 vertices of the arrangement, since we marked (at least) r/2 vertices along each of the lines
`1 , . . . , `r/2 . Furthermore, each vertex can be counted in this way at most twice. Now, observe that all these vertices are in distance
at most (r/2) + (r/2) from q because of the triangle inequality.
So, consider the set of vertices X(r) = ∪p∈P b(p, r). Clearly, as long the balls of X(r) are disjoint, the number of vertices of
the arrangement A included in X(r) is at least nr2 /8. In particular, the overall number of vertices in the arrangement is W2 , and as

such it must be two balls of X(r) are not disjoint when nr2 /8 > W2 = W(W − 1)/2. Namely, when r2 > 4W 2 /n; namely, when
√ l √ m
r > 2W/ n. Now, when r = 2W/ n + 1 there must be a vertex v in the arrangement A and two points q, r ∈ P, such that
dQ(q, v), dQ(v, r) ≤ r, and by the triangle inequality, we have that
√
dQ(q, r) ≤ dQ(q, v) + dQ(v, r) ≤ O(W/ n).

Namely, q and r are within the required crossing distance.

Let Wi denote the total weight of the lines in L in the end of the ith iteration. We have that W0 ≤ n2 , and since there are
ni = n − i + 1 points in P in the beginning of the ith iteration, it follows that the algorithm found a segment si of weight at most
√
cWi−1 / ni . We double the weight of all the lines that intersect si . Thus, it holds

√ √ Y i
√
Wi ≤ Wi−1 + cWi−1 / ni ≤ (1 + c/ ni )Wi−1 ≤ (1 + c/ nk )W0
k=1
 i 
Y
i √ X c 
≤ W0 exp c/ nk = exp √  ,
k=1 k=1 n−k+1
since 1 + x ≤ e x , for all x ≥ 0. In particular, we have that
 n  !  n 
X c  n X c  √
Wn ≤ W0 exp √  = exp √  ≤ n2 exp 4c n ,
k=1 n−k+1 2 k=1 k

Pn √ Rn √ √ x=n+1 √
since k=1 1/ k ≤ 1 + x=1
(1/ x)dx = 1 + 2 x ≤ 4 n. On the other direction, consider the heaviest line l in L in the end
x=1
of the execution of the algorithm. If it crosses ∆ lines then its weight is 2∆ , and as such
√
2∆ = ωl ≤ Wn ≤ n2 exp 4c n .
√ √
It follows that ∆ = O(log n + n), as required. Namely, any line in the plane crosses at most O( n) edges of T.

154
Theorem
√ 17.1.2 Given a set P of n points in the plane, one can compute a spanning tree T of P such that each line crosses at most
O( n) edges of T. The running time is polynomial in n.

This result also hold in higher dimensions. The proof is left as an exercise (see Exercise 17.4.1).

Theorem 17.1.3 Given a set P of n points in IRd , one can compute a spanning tree T of P such that each line crosses at most
O(n1−1/d ) edges of T. The running time is polynomial in n.

17.1.3 An application - better discrepancy

Before extending this result to more abstract settings, let us quickly outline why this spanning tree of low stabbing number leads to
better discrepancy and smallest ε-sample for halfplanes.
Indeed, let us turn T into a cycle by drawing a contour walking around the edges of T; formally, we double every edge of T,
and observe that the resulting graph is Eulerian, and we extract the Eulerian tour from this graph. Clearly, this cycle C has at most
twice the stabbing number than the tree.
Next, consider a curve γ in the plane and a segment s with the same endpoints, and observe that a line that intersects s must also
intersect γ (but not vice versa!). As such, if we shortcut parts of C by replacing them by straight segments, we are only decreasing
the stabbing number of C. To this end, start traversing C. So, we start from some arbitrary point p0 ∈ P, and start “walking” along
C. Whenever we are at point pinP, along the cycle C, we go directly from there (i.e., shortcut) to the next point visited by C that
was not visited yet. Let C 0 be the resulting cycle. Clearly, it visits all the points of P and it has a crossing number which is at most
twice the crossing number of T. Now, assuming that n = |P| is even, pick all the even edges of C 0 into a prefect matching M of P.
We have:

√
Lemma 17.1.4 One can compute a perfect matching M of a set of n points in the plane, such that every line crosses at most O( n)
edges of the matching.

(A somewhat similar argument is being used in the 2-approximation algorithm for TSP with the triangle inequality.)
Now, going back to the discrepancy question, we remind the reader that we would like to color the points by {−1, 1} such that
for any halfplane the ‘balance’ of the coloring is as close to perfect as possible.
√ To this end, we use the matching of the above
lemma, and plug it into Theorem 5.4.1. Since any line ` crosses at most #` = O( n) edges of M, we get the following result.

Theorem 17.1.5 Let P be a set of n points in the plane. One can compute a coloring χ of P by {−1, 1}, such that for all halfplane
h, it holds √
|χ(h)| = O n1/4 ln n .
√
Namely, the discrepancy of n points in relation to half planes is O(n1/4 ln n). This also implies that one can construct a better
ε-sample in this case. But before dwelling into this, let us prove a more general version of the spanning tree lemma.

17.1.4 Spanning tree for space with bounded shattering dimension

Let S = (X, R) be a range space with dual shattering dimension d. We would also assume that its shattering dimension is bounded,
say, by d0 . Let P ⊆ X be a set of n points. Consider a spanning tree T of P. The tree T is defined by n − 1 edges ei = {pi , qi } ⊆ P.
An edge {p, q} crosses a range r ∈ R if |{p, q} ∩ r| = 1. Our purpose is to build a spanning tree such that every range of R crosses
a small number of edges of T. The reader can verify that this abstract settings corresponds to the more concrete problem we just
solved.
As before, we can concentrate on the ranges of R|P , which has size m ≤ nd . So consider F be a weighted subset of these m
ranges. Our algorithm would work as before: We will find the pair p, q ⊆ P such that the total weight of the ranges is crosses is
minimized. Next, we will double the weight of these ranges, add the edge p, q to the spanning tree, and delete, say, q from P. We
repeat this process this we remain with a single point in P. Clearly, we had computed a spanning tree. What remains to be bounded
is the crossing number of this tree. As before, the analysis boils down to proving the existence of two “close” points.

Lemma 17.1.6 Let P be a set of n points, and let F be a weighted set of ranges from R|P , with total weight W. Then, there is a pair
of points e = p, q ⊆ P, such that the total weight of the ranges crossed by e is at most O(Wn−1/d log(n)).

Proof: Let ε be a parameter to be specified shortly. For an edge u, v ⊆ P, consider the set of edges that its crosses

C(u, v) = r (u ∈ r and v < r) or (u < r and v ∈ r) , for r ∈ F .

155
So, consider the dual range space S? = (R, X? )´ This range space has shattering dimension bounded by d, by assumption. Consider
the new range space

T = R, r ⊕ r0 r, r0 ∈ X?

, where ⊕ is the symmetric difference of the two sets; namely, r ⊕ r0 = (r \ r0 ) ∪(r0 \ r). By arguing as in Corollary 5.2.7, we have
that T has a shattering dimension at most 2d. Furthermore, the projected range space T|F has C(u, v) as a range.
So consider a random sample R of size O((d/ε) log(d/ε)) from F (note that F is weighted, and the random sampling is done
accordingly). By the ε-net theorem (Theorem 5.3.4), we know that with constant probability, this is an ε-net. Namely, a set C(u, v)
which do not contain any range of R has weight at most εW.
On the other hand, the range space S?|R = (R, X?|R ) has at most

!d
d d
µ = O |R|d = c log
ε ε

ranges , since the dual range space has shattering dimension d, where c is an appropriate constant. In particular, let us pick ε as
large as possible, such that µ < n = |P|. We are then guaranteed that there are two points p, q ∈ P such that R p = Rq . In particular,
the total weight of C(u, v) is εW. Thus, we left with the task of picking ε.
So, for ε = c1 n−1/d log(n) it holds that µ < n, if c1 is sufficiently large, as can be verified.
To complete the argument, we need to argue about the total weight of the ranges in the end of this process. It is bounded by
!  
Yn
log(i) X log(i) 
d0 d0 0
U=n 1 + c1 1/d ≤ n exp c1 1/d  ≤ nd exp O c1 n1−1/d log(n)

i=1
i i
i

Now, the crossing number of the resulting tree T, for any range r ∈ R is bounded by lg(U) = O(d0 log n + n1−1/d log(n)). We thus
conclude:

Theorem 17.1.7 Given a range space S = (X, R) with shattering dimension d0 and dual shattering dimension d, and a set P ⊆ X
of n points, one can compute, in polynomial time, a spanning tree T of P, such that any range of R crosses at most O(d0 log n +
n1−1/d log(n)) edges of T.

17.2 Geometric Set Cover

Let S = (X, R) be a range space with bounded VC dimension. For example, let X be a set of n points in the plane, and let R be a set
of m allowable disks. The question is to find the minimal number of disks of R one needs to cover the points of X.
This is an instance of Set Cover, and it is in general NP-H, and even NP-H to approximate within a factor of Ω(log n).
There is an easy greedy algorithm, which repeatedly pick the set (i.e., disk) that covers the largest number of points not covered yet.
It is easy to show that his algorithm would cover the points using O(k log n) sets, where k is the number of sets used by the optimal
solution.

The algorithm. Interestingly, one can do much better if the set system S has bounded dual shattering dimension d. Indeed, let
us assign weight 1 to each range of R, and pick a random subset R of size O((d/ε) log(d/ε)) from R, where ε = 1/(4k) (the sample
is done according to the weights). If the sets of R covers all the points of R, then we are done. Otherwise, consider a point p ∈ X
P
which is not covered by R. If the total weight of Rp (i.e., the set of ranges covering p) is smaller than εW(R) = r∈R ωr then we
redouble the weight of all the sets of Rp . In any case, even if doubling is not carried out, we repeat this process till it succeeds.

Details and Intuition. In the above algorithm, if a random sample fails (i.e., there is an uncovered point) then one of the
ranges that covers p must be in the optimal solution. In particular, by increasing the weight of the ranges covering p, we improve
the probability that p would be covered in the next iteration. Furthermore, with good probability, the sample is an ε-net, and as
such the algorithm doubles the weight of a “few” ranges. One of these few ranges must be in the optimal solution. As such, the
weight of the optimal solution grows exponentially in a faster rate than the total weight of the universe, implying that at some point
the algorithm must terminates, as the weight of the optimal solution exceeds the total weight of all the ranges, which is of course
impossible.

´
We remind the reader that X? = Rp p ∈ X and Rp = r r ∈ R, the range r contains p .

If the ranges of R are geometric shapes, it means that the arrangement formed by the shapes of R has at most µ faces.

156
17.2.1 Proof of correctness
In the following, we bound the number of iterations performed by the algorithm. As before, let W0 = m be the initial weight of the
ranges, and Wi would be the weight in the end of the ith iteration. We consider an iteration to be successful if the doubling stage is
being performed in the iteration. Since an iteration is successful if the sample is an ε-net, and by Theorem 5.3.4 the probability for
that is at least, say, 1/2, it follows that we need to bound only the number of successful iterations (indeed, it would be easy to verify
that with high probability, using the Chernoff inequality, the number of successful iteration is at least a quarter of all iterations
performed).
As before, we know that Wi ≤ (1 + ε)Wi−1 = (1 + ε)i m ≤ m exp(εi). On the other hand, in each iteration the algorithm “hits” at
least one of the ranges in the optimal solution. Let ti ( j) be the number of times the weight of the jth range in the optimal solution
was doubled, for j = 1, . . . , k, where k is the size of the optimal solution. Clearly, the weight of the universe in the ith iteration is at
least
Xk
2ti ( j) .
j=1

But this quantity is minimized when ti (1), . . . , eti (k) are as equal to each other as possible. (Indeed, 2a + 2b ≥ 2 · 2b(a+b)/2c , for any
integers a, b ≥ 0.) As such, we have that
Xk
k2bi/kc ≤ 2ti ( j) ≤ Wi ≤ m exp(εi) .
j=1
t
So, consider i = tk, for t an integer. We have that k2 ≤ m exp(εtk) = m exp(t/4), since ε = 1/4k. Namely,
t t
lg k + t ≤ lg m + lg e ≤ lg m + ,
4 2
as lg e ≤ 1.45. Namely, t ≤ 2 lg(m/k). We conclude that the algorithm performs at most 2 lg(m/k) successful iterations, and as such
it performs at most O(log(m/k)) iterations overall.

Running time. It is easy to verify that with careful implementation the sampling stage can be carried out in linear time. The
size of the resulting approximation is O((d/ε) log(d/ε)) = O(dk log dk). Checking if all the points are covered by the random
sample takes O(ndk log dk) time, assuming we can in constant time determine if a point is inside a range. Computing the total
weight of the ranges covering p takes O(m) time. Thus, each iteration takes O(m + ndk log dk) time.
Note, that we assumed that k is known to us in advance. This can be easily overcome by doing an exponential search for the
right value of k. Given a guess κ to the value of k, we will run the algorithm with κ. If the algorithm exceeds c log(m/k) iterations
without terminating, we know the guess is too small and we continue to try 2κ. We thus get the following.

Theorem 17.2.1 Given a finite range space S = (X, R) with n points and m ranges, and S has a dual shattering dimension d, then
one can compute a cover of X, using the ranges of R, such that the cover uses O(dk log(dk)) sets, where k is the size of the optimal
set cover.
The running time of the algorithm is O((m + ndk log(dk)) log(m/k) log n) time, with high probability, assuming that in constant
time we can decide if a point is inside a range.

17.2.2 Application - Guarding an art gallery

Let P be a simple polygon (i.e., without holes) with n sides in the plane. We would like to find a minimal number of guards that
see the whole polygon. A guard is a point that can see in all directions, but, unlike superman, it can not see through the walls of the
polygon. There is a beautiful standard result in computational geometry that shows that such a polygon can always be guarded by
bn/3c guards placed on the vertices of the polygons. But of course, this might be arbitrarily bad compared to the optimal solution
using k guards.
So, consider the range space S = (P , R), where R are all the possible visibility polygons inside P . We remind the reader that
for a point p ∈ P , the visibility polygon of p in P is the polygon

VP (p) = q q ∈ P , pq ⊆ P ,

where we consider P to be a closed set.®

The following theorem is a bit surprising, see Section 17.2.2.1 for the proof.

Theorem 17.2.2 The VC dimension of the range space formed by all visibility polygons inside a polygon P (i.e., S above) is a
constant.
®
As such, under our definition of visibility, one can see through a reflex corner of a polygon.

157
So, consider a simple polygon P with n vertices, and let b P be a (finite) set of visibility polygons that covers P (say, all the
visibility polygons induced by vertices of P ). The range space induced by P and b P has finite VC dimension (since this range space
is contained inside the range space of Theorem 17.2.2). To make things more concrete, consider placing a point inside each face of
the arrangement of the visibility polygons of bP inside P , and let U denote
the resulting
set of points. Next, consider the projection
b
of this range space into U; namely, consider the range space S = U, Q ∩ U Q ∈ P . Clearly, a set cover of minimal size for S
corresponds to a minimal number of visibility polygons of bP that covers P . Now, S has size polynomial in P and b P and it can be
clearly be computed efficiently. We can now apply the algorithm of Theorem 17.2.1 to it, and get a “small” set cover. We conclude:

Theorem 17.2.3 Given a simple polygon P of size n, and a set of visibility polygons b P (of polynomial size) that covers P , then one
can compute, in polynomial time, a cover of P using O(kopt log kopt ) polygons of b
P, where kopt is the smallest number of polygons of
b
P that completely covers P .

17.2.2.1 A proof of Theorem 17.2.2

Proof: Let U be a set of k points shattered by the set of visibility polygons inside P . To this end, consider the set b P of k visibility
polygons induced by the points of U. Clearly, by the shattering property, for every subset S of U, there exists a point p ∈ P such
that the visibility polygon of p contains exactly S and no other points of U. Conversely, in the arrangement A(b P), the point p is
contained in a face that is covered by the visibility polygons of S , and is outside all the visibility polygons of U \ S .
Unfortunately, we can not just bound the complexity of A(b P), and use it to bound the VC dimension of the given range space,
since its complexity involves n (it is O(nk2 ) in the worst case). Instead, let T (k, P ) be the maximum number of different subsets of
k fixed points one can see from inside a polygon Q ⊆ P . Similarly, let T out (k, Q ) be the maximum number of different subsets of k
fixed points one can see inside a polygon Q ⊆ P , where all these points must lie outside Q , where there is a diagonal s such that all
the points are one side of s and Q is on the other side.

Lemma 17.2.4 It holds T out (k, Q ) = O(k6 ).

Proof: Let S be the set of k points under consideration, which are on one side of s and Q is on the other side. Each of these
points sees a subinterval of s (inside P ), and this interval induces a wedge in the plane on the other side of s (we ignore for
the time being the boundary of P ). Also, consider the lines passing through pair of points of S . Together, this set of O(k2 )
lines, rays and segments, partition the place into O(k4 ) different faces, so that for a point inside a face of this arrangement, it
sees the same subset of points of S through the segment s in the same radial order. So, consider a point p inside such a face f ,
and let q1 , . . . , qm be the (clockwise ordered) list of points of S that p sees. To visualize this list, connect p to each one of this
points by a segment. Now, as we now introduce the boundaries of P , some of these segments are no longer realizable since
the intersect the boundary of polygon. Observe, however, the set of visible points from p must be a consecutive subsequence
of the above list; namely, p can see inside P the points qL , . . . , qR , for some L ≤ R. We conclude that since there are O(k2 )
choices for the indices L and R, it follows that inside f there are most O(k2 ) different subsets of S that are realizable inside
Q . Now, there are O(k4 ) such faces in the arrangement, which implies the bound.

Now, consider a triangulation of P . There must be a triangle in this triangulation, such that if we remove it, then every one of
the remaining pieces Q1 , Q2 , Q3 contains at most 2k/3 points of U. Let 4 be this triangle.
Clearly, the complexity of the visibility polygons of b P inside 4 is O(k2 ). Furthermore, inside Qi one can see only T out (k, Qi )
different subsets of the points of U outside Qi , for i = 1, 2, 3. Thus, the total number of different subsets of U one can see is bounded
by
X
3
T (k, P ) ≤ O(k2 ) + T out (k, Qi ) · T (|U ∩ Qi | , Qi ) = O(k2 ) + O(k6 )T (2k/3) = kO(log k) ,
i=1

by Lemma 17.2.4. However, for U to be shattered, we need that T (k, P ) ≥ 2k . Namely, we have that

2k ≤ kO(log k) .

Namely, k ≤ O(log2 k), which implies that k is a constant, as desired.

17.3 Bibliographical Notes

The stabbing tree result is due to Welzl [Wel92], which is in turn a simplification of the work of Chazelle and Welzl [CW89]. The
running time of computing the good spanning tree can be improved to O(n1+ε ) [Wel92].

158
A natural question is whether one can compute a spanning tree which is weight sensitive. Namely, √ compute a spanning tree of
n points in the plane, such that if a line has only k points on one side of it, it would cross at most O( k) edges of the tree. A slightly
weaker bound is currently known, see Exercise 17.4.3, which is from [HS06]. This leads to small ε-samples which work for ranges
which are heavy enough.
Another natural question is whether given a set of lines and points in the plane, can one compute a spanning tree with overall
small number of crossings. Surprisingly, one can do (1 + ε)-approximation for the best tree in near linear time, see [HI00]. This
result also extends to higher dimensions.
The algorithm we described for Set Cover falls into a general method of multiplicative weights update. Algorithms of this
family include Littleston’s Winnow algorithm [Lit88], Adaboost [FS97], and many more. In fact, the basic idea can be tracked back
to the fifties. See [AHK06] for a nice survey of this method.

17.4 Exercises
Exercise 17.4.1 Prove Theorem 17.1.3.

Exercise 17.4.2 Show that in the worst case,√one can pick a point set P of n points in the plane, such that any spanning tree T of
P, there exists a line `, such that ` crosses Ω( n) edges of T.

Exercise 17.4.3 [20 Points] Let P be a set of n points in the plane. For a line `, let w+` (resp., w−` ) be the number of points of P
lying above (resp., below or on) `, and define the weight of `, denoted by ω`, to be min(w+` , w−` ).
√
Show, that one can construct a spanning tree T for P such that any line ` crosses at most O( ω` log(n/ω`)) edges of T.

159
160
Chapter 18

Approximating the Minimum Volume

Bounding Box of a Point Set

Isn’t it an artificial, sterilized, didactically pruned world, a mere sham world in which you cravenly vegetate, a world
without vices, without passions without hunger, without sap and salt, a world without family, without mothers,
without children, almost without women? The instinctual life is tamed by meditation. For generations you have left
to others dangerous, daring, and responsible things like economics, law, and politics. Cowardly and well-protected,
fed by others, and having few burdensome duties, you lead your drones’ lives, and so that they won’t be too boring
you busy yourselves with all these erudite specialties, count syllables and letters, make music, and play the Glass
Bead Game, while outside in the filth of the world poor harried people live real lives and do real work.
– The Glass Bead Game, Hermann Hesse

18.1 Some Geometry

Let H = [0, 1]d denote the unit hypercube in IRd . The width of a convex body P in IRd is the minimum distance between two parallel
planes that encloses P. Alternatively, the width is the minimum length of the projection of P into a line in IRd . The diameter of P
is the maximum distance between two points of P. Alternatively, this is the maximum length projection of P into a line.
√
Lemma 18.1.1 The width of H is 1 and its diameter is d.

Proof: The upper bound on the width is obvious. As for the lower bound, observe that P encloses a ball of radius 1/2, and as such
its projection in any direction is at least 1.
The diameter is just the distance between the two points (0, . . . , 0) and (1, . . . , 1).

Lemma 18.1.2 Let P be a convex body, and let h be a hyperplane cutting P. Let µ = Vol(h ∩ P), and let ~v be unit vector orthogonal
to h. Let ρ be the length of the projection of P into the direction of ~v. Then Vol(P) ≥ µ · v/d.

Proof: Let P+ = P ∩ h+ and a be the point of maximum distance of P+ from h, where h+ denotes the positive half space induced
by h.
Let C = h ∩ P. By the convexity of P, the pyramid R = CH(C ∪ {a}) is contained inside P, where CH(S ) denotes the
convex-hull of S . We explicitly compute the volume of R for the sake of completeness. To this end, let s be the segment connecting
a to its projection on h, and let α denote the length of s. Parameterizing s by the interval [0, ρ− ], let C(t) denote the intersection
of the hyperplane passing through s(t) and R, where s(0) = a, and s(t) ∈ h. Clearly, Vol(C(t)) = (t/α)d−1 Vol(C) = (t/α)d−1 µ. (We
abuse notations a bit and refer to the (d − 1)-dimensional volume and d-dimensional volume by the same notation.) Thus,
Z α Z α Z α
µ µ αd αµ
Vol(R) = Vol(C(t)) dt = (t/α)d−1 dt = d−1 td−1 dt = d−1 · = .
t=0 t=0 α t=0 α d d
Thus, Vol(P+ ) ≥ αµ/d. Similar argumentation can be applied to P− = P ∩ h− . Thus, we have Vol(P) ≥ ρµ/d.

Lemma 18.1.3 Let h be any hyperplane. We have Vol(h ∩ [0, 1]d ) ≤ d.

Proof: Since the width of H = [0, 1]d is 1 and thus the projection of H on the direction perpendicular to h is of length ≥ 1. As
such, by Lemma 18.1.2 if Vol(h ∩ H) > d then Vol(H) > 1. A contradiction.

161
Lemma 18.1.4 Let P ⊆ [0, 1]d be a convex body. Let µ = Vol(P). Then ω(P) ≥ µ/d and P contains a ball of radius µ/(2d2 ).

Proof: By Lemma 18.1.3, any hyperplane cut P in a set of volume at most d. Thus, µ = Vol(P) ≤ ω(P)d. Namely, ω(P) ≥ µ/d.
Next, let E be the largest volume ellipsoid that is contained inside P. By John’s theorem, we have that P ⊆ dE. Let α be the
length of the shortest axis of E. Clearly, ω(P) ≤ 2dα, since ω(dE) = 2dα. Thus 2dα ≥ µ/d. This implies that α ≥ µ/(2d2 ).
Thus, E is an ellipsoid with its shortest axis is of length α ≥ µ/(2d2 ). In particular, E contains a ball of radius α, which is in
turn contained inside P.
In Lemma 18.1.4 we used the fact that r(P) ≥ ω(P)/(2d) (which we proved using John’s theorem), where r(P) is the radius
of the largest
√ ball enclosed inside P. Not surprisingly, considerably
√ better bounds are known. √ In particular, it is known that
ω(P)/(2 d) ≤ r(p) for add dimension, and ω(P)/(2(d + 1)/ d + 2) ≤ r(P). Thus, r(p) ≥ ω(P)/(2 d + 1) [GK92]. Plugging this
fact into the proof of Lemma 18.1.4, will give us slightly better result.

18.2 Approximating the Diameter

Lemma√18.2.1 Given a point set S in IRd , one can compute in O(nd) time a pair of points s, t ∈ S , such that |st| ≤ diam(S ) ≤
min(2, d)|st|.

Proof: Let B be the minimum axis-parallel box containing S , and let s and t be the points in S that √ define√ the longest edge of
B, whose length is denoted by l. By the diameter definition, kstk ≤ diam(S ), and clearly, diam(S ) ≤ d l ≤ d kstk. The points s
and t are easily found in O(nd) time.
Alternatively, pick a point s0 ∈ S , and compute its furthest point t0 ∈ S . Next, let a, b be the two points realizing the diameter.
We have diam(S ) = kabk ≤ kas0 k + ks0 bk ≤ 2 ks0 t;k. Thus, ks0 t0 k is 2-approximation to the diameter of P.

18.3 Approximating the minimum volume bounding box

18.3.1 Constant Factor Approximation
Lemma 18.3.1 Given a set P of n points in IRd , one can compute in O(d2 n) time a bounding box B(P) with Vol(Bopt (P)) ≤
Vol(B(P)) ≤ 2d d! Vol(Bopt (P)), where Bopt is the minimum volume bounding box of P.
Furthermore, there exists a vector v ∈ IRd , such that c · B + v ⊆ CH(P). constant c = 1/ 2d d!d5/2 .

Proof: By using the algorithm of Lemma 18.2.1 we compute in O(n) time two points s, t ∈ P which form a 2-approximation of
the diameter of P. For the simplicity of exposition, we assume that st is on the xd -axis (i.e., the line ` ≡ ∪ x (0, . . . , 0, x)), and there
is one point of S that lies on the hyperplane h ≡ xd = 0, an that xd ≥ 0 for all points of P.
Let Q be the orthogonal projection of P into h, and let I be the shortest interval on ` which contain the projection of P into ` s .
By recursion, we can compute a bounding box B0 of Q in h. Let the bounding box be B = B0 × I. Note, that in the bottom of
the recursion, the point-set is one dimensional, and the minimum interval containing the points can be computed in linear time.
Clearly, P ⊆ B, and thus we only need to bound the quality of approximation. We next show that Vol(B) ≥ Vol(P)/cd , where
C = CH(P), and cd = 2d · d!. We prove this by induction on the dimension. For d = 1 the claim trivially holds. Otherwise, by
induction, that Vol(B0 ) ≥ Vol(C 0 )/cd−1 , where C 0 = CH(Q).
For a point p ∈ C 0 , let ` p be the line parallel to xd -axis passing through p. Let L(p) be the minimum value of xd for the
points of ` p lying inside C, and similarly, let U(p) be the maximum value of xd for the points of ` p lying inside C. That is
` p ∩ C = [L[p), U(p)]. Clearly, since C is convex, the function L(·) is concave, and U(·) is convex. As such, γ(p) = U(p) − L(p) is
a convex function, being the difference between a convex and a concave function. In particular, γ(·) induces the following convex
body
[
U= (x, 0), (x, γ(x)) .
x∈C 0

Clearly, Vol(U) = Vol(C). Furthermore, γ((0, . . . , 0)) ≥ kstk and U is shaped like a “pyramid” its base is on the hyperplane xd = 0
is the set C 0 , and the segment [(0, . . . , 0), (0, . . . , 0, kstk)] is contained inside it. Thus,

Vol(C) = Vol(U) ≥ |st| Vol(C 0 )/d,

by Lemma 18.1.2. Let r = |I| be the length of the projection of S into the line `, we have that r ≤ 2 kstk. Thus,

Vol(B) = Vol(B0 ) |I| ≤ Vol(B0 ) kstk 2 ≤ Vol(C 0 ) · cd−1 kstk 2.

On the other hand,

Vol(C 0 ) kstk (Vol(B0 )/cd−1 ) · (2 kstk) Vol(B0 ) |I| Vol(B)
Vol(Bopt (P)) ≥ Vol(C) ≥ ≥ ≥ = d .
d 2d 2cd−1 d 2 d!

162
Let T be an affine transformation that maps B to the unit hypercube H = [0, 1]d . Observe that Vol(T (C)) ≥ 1/cd . By
2
Lemma
√ 18.1.4, 2there
√ is ball b of radius r d≥ 1/(c d · 2d ) contained inside T (C). The ball b contains a hypercube of sidelength
2r/ d ≥ 2/(2d dcd ). Thus, for c = 1/ 2 d!d , there exists a vector v0 ∈ IRd , such that cH + v0 ⊆ T (C). Thus, applying T −1 to
5/2

both sides, we have that there exists a vector v ∈ IRd , such that c · B + v = c · T −1 (H) + v ⊆ C.

18.4 Exact Algorithms

18.4.1 An exact algorithm 2d
Let P be a set of points in the plane. We compute the convex hull C = CH(P). Next, we rotate to lines parallel to each other, which
touches the boundary of C. This can be easily done in linear time. We can also rotate two parallel lines which are perpendicular
to the first set. Those four lines together induces a rectangle. It is easy to observe that during this rotation, we will encounter the
minimum area rectangle. The function those lines define, changes every time the lines changes the vertices they rotate around. This
happens O(n) time. When the vertices the lines rotate around are fixed, the area function is a constant size function, and as such
its minimum/maximum can be computed in linear time. Thus, the minimum volume bounding box, can be computed in O(n log n)
time.
An important property of the minimum area rectangle, is that one of the edges of the convex hull lie on one of the bounding
rectangle edges. We will refer to this edge as being flush. Thus, there are only n possibilities we have to check.

18.4.2 An exact algorithm 3d

Let P be a set of points in IR3 , our purpose is to compute the minimum volume bounding box Bopt of P. It is easy to verify that Bopt
must touch P on every one of its faces. In fact, consider an edge e for the bounding box Bopt . Clearly, if we project the points in the
direction of e into a perpendicular plane h, it must hold that the projection of Bopt into this plane is a minimum area rectangle. As
such, it has one flush edges, which corresponds to a flush edge of the convex hull of P that must lie on a face of Bopt . In fact, there
must be two adjacent faces of Bopt that have flush edges of CH(P) on them.
Otherwise, consider a face f of Bopt that has an edge flush on it. All the four adjacent faces of Bopt do not have flush edges on
them. But thats not possible, since we can project the points in the direction of the normal of f , and argue that in the projection
there must be a flush edge. This flush edge, corresponds to an edge of CH(P) that lies on one of the faces of Bopt that is adjacent to
f.

Lemma 18.4.1 If Bopt is the minimum volume bounding box of P, then it has two adjacent faces which are flush.

This provides us with a natural algorithm to compute the minimum volume bounding box. Indeed, let us check all possible
pair of edges e, e0 ∈ CH(P). For each such pair, compute the minimum volume bounding box that has e and e0 as flush.
Consider the normal ~n of the face of a bounding box that contains e. The normal ~n lie on a great circle on the sphere of
directions, which are all the directions that are orthogonal to e. Let us parameterize ~n by a point on this normal. Next, consider
the normal n~0 to the face that is flush to e0 . Clearly, n~0 is orthogonal both to e0 and ~n. As such, we can compute this normal in
constant time. Similarly, we can compute the third direction of the bounding box using vector product in const time. Thus, if e
and e0 are fixed, there is one dimensional family of bounding boxes of P that have e and e0 flush on them, and comply with all the
requirements to be a minimum volume bounding box.
It is now easy to verify that we can compute the representation of this family of bounding boxes, by tracking what vertices of
the convex-hull the bounding boxes touches (i.e., this is similar to the rotating calipers algorithm, but one has to be more careful
about the details). This can be done in linear time, and as such, one con compute the minimum volume bounding box in this family
in linear time. Doing this for all pair of edges, results in O(n3 ) time algorithm, where n = |P|.

Theorem 18.4.2 Let P be a set of n points in IR3 . One can compute the minimum volume bounding box of P in O(n3 ) time.

18.5 Approximating the Minimum Volume Bounding Box in Three Dimen-

sions
Let P be a set of n points in IR3 , and let Bopt denote the minimum volume
bounding box of P. We remind the reader, that for two

3
sets A and B in IR . The Minkowski sum of A and B is the set A ⊕ B = a + b a ∈ A, b ∈ B .
Let B = B(P) be the bounding box of P computed by Lemma 18.3.1, and let Bε be a translated copy of εc B centered at the
origin, where c is an appropriate constant to be determined shortly. In addition, define Q = CH(P) ⊕ Bε and G = G( 21 Bε ) denote
the grid covering space, where every grid cell is a translated copy of Bε /2. We approximate P on G. For each point p ∈ P let G(p)

163
be the set of eight vertices of the cell of G that contains p, and let S G = ∪ p∈S G(p). Define P = CH(S G ). Clearly, CH(P) ⊆ P ⊆ Q.
Moreover, one can compute P in O(n + (1/ε2 ) log (1/ε)) time. On the other hand, P ⊆ B ⊕ Bε . The latter term is a box which
contains at most k = 2c/ε + 1 grid points along each of the directions set by B, so k is also an upper bound for the number of grid
points contained by P in each direction. As such, the convex hull of CH(P) is O(k2 ), as every grid line can contribute at most two
vertices to the convex hull. Let R the set of vertices of P. We next apply the exact algorithm of Theorem 18.4.2 to R. Let b B denote
the resulting bounding box.
It remains to show that b B is a (1 + ε)-approximation of Bopt (P). Let Bεopt be a translation of 4ε Bopt (P) that contains Bε . (The
existence of Bεopt is guaranteed by Lemma 18.3.1, if we take c = 160.) Thus, R ⊆ CH(P) ⊕ Bε ⊆ CH(P) ⊕ Bεopt ⊆ Bopt (P) ⊕ Bεopt .
Since Bopt (P) ⊕ Bεopt is a box, it is a bounding box of P and therefore also of CH(P). Its volume is
ε 3
Vol(Bopt (P) ⊕ Bεopt ) = 1 + Vol(Bopt (P)) < (1 + ε) Vol(Bopt (P)),
4
as desired. (The last inequality is the only place where we use the assumption ε ≤ 1.)
To recap, the algorithm consists of the four following steps:
1. Compute the box B(P) (see Lemma 18.3.1) in O(n) time.
2. Compute the point set S G in O(n) time.
3. Compute P = CH(S G ) in O(n + (1/ε2 ) log (1/ε)) time. This is done by computing the convex hull of all the extreme points
of S G along vertical lines of G. We have O(1/ε2 ) such points, thus computing their convex hull takes O((1/ε2 ) log(1/ε))
time. Let R be the set of vertices of P.
4. Compute Bopt (R) by the algorithm of Theorem 18.4.2. This step requires O((1/ε2 )3 ) = O(1/ε6 ) time.

Theorem 18.5.1 Let P be a set of n points in IR3 , and let 0 < ε ≤ 1 be a parameter. One can compute in O(n + 1/ε6 ) time a
bounding box B(P) with Vol(B(P)) ≤ (1 + ε) Vol(Bopt (P)).

Note that the box B(S ) computed by the above algorithm is most likely not minimal along its directions. The minimum
bounding box of P homothet of B(S ) can be computed in additional O(n) time.

18.6 Bibliographical notes

Our exposition follows roughly the work of Barequet and Har-Peled [BH01]. However, the basic idea, of finding the diameter,
projecting along it and recursively finding a good bounding box on the projected input, is much older, and can be traced back to the
work of Macbeath [Mac50]. √
For approximating the diameter, one can find in linear time a (1/ 3)-approximation of the diameter in any dimension;
see [EK89].
The rotating calipers algorithm (Section 18.4.1) is due to Toussaint [Tou83]. The elegant extension of this algorithm to the
computation of the exact minimum volume bounding box algorithm is due to O’Rourke [O’R85].
Lemma 18.3.1 is (essentially) from [BH01].
The current constants in Lemma 18.3.1 are unreasonable, but there is no reason to believe they are tight.

Conjecture 18.6.1 The constants in Lemma 18.3.1 can be improved to be polynomial in the dimension.

Coresets. One alternative approach to the algorithm of Theorem 18.5.1 is to construct G using Bε /2 as before, and picking
from each non-empty cell of G, one point of P as a representative point. This results in a set S of O(1/ε2 ) points. Compute the
minimum volume bounding box S using the exact algorithm. Let B denote the resulting bounding box. It is easy to verify that
(1 + ε)B contains P, and that it is a (1 + ε)-approximation to the optimal bounding box of P. The running time of the new algorithm
is identical. The interesting property is that we are running the exact algorithm on on a subset of the input.
This is a powerful technique for approximation algorithms. You first extract a small subset from the input, and run an exact
algorithm on this input, making sure that the result provides the required approximation. The subset S is referred to as coreset of B
as it preserves a geometric property of P (in our case, the minimum volume bounding box). We will see more about this notion in
the following lectures.

164
Chapter 19

Approximating the Directional Width of a

Shape

“From the days of John the Baptist until now, the kingdom of heaven suffereth violence, and the violent bear it
away.”
– – Matthew 11:12

19.1 Coreset for Directional Width

Let P be a set of points in IRd . For a vector v ∈ IRd , such that v , 0, let
ω(v, P) = max hv, pi − min hv, pi ,
p∈P p∈P

denote the directional width of P in the direction of v.

A set Q ⊆ P is a ε-coreset for directional width, if
∀v ∈ S(d−1) ω(v, Q) ≥ (1 − ε)ω(v, P).
Namely, the coreset Q provides a concise approximation to the directional width of P. The usefulness of such a coreset might
become clearer in the light of the following claim.

Claim 19.1.1 Let P be a set of points in IRd , 0 < ε ≤ 1 a parameter and let Q be a δ-coreset of P for directional width, for
δ = ε/(8d). Let Bopt (Q) denote the minimum volume bounding box of Q. Let B0 be the rescaling of Bopt (Q) around its center by a
factor of (1 + 3δ).
Then, P ⊂ B0 , and in particular, Vol(B0 ) ≤ (1 + ε)Bopt (P), where Bopt (P) denotes the minimum volume bounding box of P.

Proof: Let v be a direction parallel to one of the edges of Bopt (Q), and let ` be a line through the origin with the direction of
v. Let I and I 0 be the projection of Bopt (Q) and B0 , respectively, into `. Let IP be the interval formed by the projection of CH(P)
into `. We have that I ⊆ IP and |I| ≥ (1 − δ) |IP |. The interval I 0 is the result of expanding I around
its center point c by a factor of
1 + 3δ. In particular, the distance between c and the furthest endpoint of I p is ≤ (1 − (1 − δ)/2) I p = (1 + δ) I p /2. Thus, we need
to verify that after the expansion of I it contains this endpoint. Namely,
1−δ 1 + 2δ − 3δ2 1+δ
(1 + 3δ) |IP | ≥ |IP | ≥ |IP | ,
2 2 2
for δ ≤ 1/3. Thus, IP ⊆ I 0 and P ⊆ B0 .
Observe that Vol(B0 ) ≤ (1 + 3δ)d Vol(Bopt (Q)) ≤ exp(3δd) Vol(Bopt (Q)) ≤ (1 + ε)Bopt (Q) ≤ (1 + ε)Bopt (P).
It is easy to verify, that a coreset for directional width, also preserves (approximately) the diameter and width of the point set.
Namely, it captures “well” the geometry of P. Claim 19.1.1 hints on the connection between coreset for directional width and the
minimum volume bounding box. In particular, if we have a good bounding box, we can compute a small coreset for directional
width.

Lemma 19.1.2 Let P be a set of n points in IRd , and let B be a bounding box of P, such that v + cd B ⊆ CH(P), where v is a vector
in IRd , cd = (4d + 1)d and cd B denote the rescaling of B by a factor of cd around its center point.
Then, one can compute
a ε-coreset S for directional width of P. The size of the coreset is O(1/εd−1 ), and construction time is
d−1
O n + min(n, 1/ε ) .

165
Proof: We partition B into a grid, by breaking each edge of B into M = d4/(εcd )e equal length intervals (namely, we tile B with
M d copies of B/M). A cell in this grid is uniquely defined by a d-tuple (i1 , . . . , id ). In particular, for a point p ∈ P, lets I(p) denote
the ID of this point. Clearly, I(p) can be computed in constant time.
Given a (d − 1)-tuple I = (i1 , . . . , id−1 ) its pillar is the set of grid cells that have (i1 , . . . , id−1 ) as the first d − 1 coordinates of
their ID. Scan the points of P, and for each pillar record the highest and lowest point encountered. Here highest/lowest refer to their
value in the dth direction.
We claim that the resulting set S is the required coreset. Indeed, consider a direction v ∈ S(d−1) , and a point p ∈ P. Let q, q0 be
the highest and lowest points in P which are inside the pillar of p, and are thus in S. Let Bq , Bq0 be the two grid cells containing q
and q0 . Clearly, the projection of CH(Bq ∪ Bq0 ) into the direction of v contains the projection of p into the direction of v. Thus, for
a vertex u of Bq it is sufficient to show that
ω(v, {u, q}) ≤ ω(v, B/M) ≤ (ε/2)ω(v, P),
since this implies that ω(v, S) ≥ ω(v, P) − 2ω(v, B/M) ≥ (1 − ε)ω(v, P). Indeed,
ω(v, P) ε
ω(v, B/M) ≤ ω(v, B)/M ≤ ω(v, P)/(cd M) ≤ ≤ ω(v, P).
4/ε 4
As for the preprocessing time, it requires a somewhat careful implementation. We construct a hash-table (of size O(n)) and
store for every pillar the top and bottom points encountered. When handling a point this hash table can be updated in constant time.
Once the coreset was computed, the coreset can be extracted from the hash-table in linear time.

Theorem 19.1.3 Let P be a set of n points in IRd , and let 0 < ε < 1 be a parameter.
One can compute
a ε-coreset S for directional
width of P. The size of the coreset is O(1/εd−1 ), and construction time is O n + min(n, 1/εd−1 ) .

Proof: Compute a good bounding box of P using Lemma 18.3.1. Then apply Lemma 19.1.2 to P.

19.2 Smaller coreset for directional width

We call P ⊆ IRd α-fat, for α ≤ 1, if there exists a point p ∈ IRd and a hypercube H centered at the origin so that
p + αH ⊂ CH(P) ⊂ p + H.

19.2.1 Transforming a set into a fat set

Lemma 19.2.1 Let P be a set of n points in IRd such that the volume of CH(P) is non-zero, and let H = [−1, 1]d . One can compute
in O(n) time an affine transform τ so that τ(P) is an α-fat point set satisfying αH ⊂ CH(τ(P)) ⊂ H, where α is a positive constant
depending on d, and so that a subset S ⊆ P is an ε-coreset of P for directional width if and only if τ(Q) is an ε-coreset of τ(P) for
directional width.

Proof: Using the algorithm of Lemma 18.3.1 compute, in O(n) time, a bounding B of P, such that there exists a vector w ~ such
~ + d(4d1+1) B ⊆ CH(P) ⊆ B.
that w
Let T 1 be the linear transformation that translates the center of T to the origin. Clearly, S ⊆ P is a ε-coreset for direction width
of P, if and only if S1 = T 1 (S) is an ε-coreset for directional width of P1 = T 1 (P). Next, let T 2 be a rotation that rotates B1 = T 1 (B)
such that its sides are parallel to the axises. Again, S2 = T 2 (S1 ) is a ε-coreset for P2 = T 2 (P1 ) if and only if S1 is a coreset for P1 .
Finally, let T 3 be a scaling of the axises such that B2 = T 2 (B1 ) is mapped to the hypercube H.
Note, that T 3 is just a diagonal matrix. As such, for any p ∈ IRd , and a vector v ∈ S(d−1) we have
T D E
hv, T 3 pi = vT T 3 p = T 3T v p == T 3T v, p = hT 3 v, pi .

Let S3 = T 3 (S2 ) and let P3 = T 3 (P2 ). Clearly, for v ∈ IRd , v , 0 we have

ω(v, P3 ) = max hv, pi − min hv, pi = max hv, T 3 pi − min hv, T 3 pi = max hT 3 v, pi − min hT 3 v, pi
p∈P3 p∈P3 p∈P2 p∈P2 p∈P2 p∈P2

= ω(T 3 v, P2 ).
Similarly, ω(v, S3 ) = ω(T 3 v, S2 ).
By definition, S2 is a ε-coreset for P2 iff for any non zero v ∈ IRd , we have ω(v, P2 ) ≥ (1 − ε)ω(v, S2 ). Since T 3 non
singular, this implies that for any non-zero v, we have ω(T 3 v, S2 ) ≥ (1 − ε)ω(T 3 v, P2 ), which holds iff ω(v, S3 ) = ω(v, T 3 (S2 )) ≥
(1 − ε)ω(v, T 3 (P2 )) = (1 − ε)ω(v, P3 ). Thus S3 is a ε-coreset for P3 . Clearly, the other direction holds by a similar argumentation.
Set T = T 3 T 2 T 1 , and observe that, by the above argumentation, S is a ε-coreset for P if and only if T (S) is a ε-coreset for
T (P). However, note that T (B) = H, and T (~ w + d(4d1+1) B) ⊆ CH(T (P)). Namely, there exists a vector w~0 such that w~0 + d(4d1+1) H ⊆
CH(T (P)) ⊆ H. Namely, the point set T (P) is α = d(4d1+1) -fat.

166
p
(L − r)2 + r2 p

{
r
L−r (L, 0)

Figure 19.1: Illustration of the proof of Lemma 19.2.3.

19.2.2 Computing a smaller coreset

Observation 19.2.2 If A is a δ-coreset for directional width of B, and B is a ε-coreset for directional width of C, then A is a
(δ + ε)-coreset of C.

Proof: For any vector v, we have ω(v, A) ≥ (1 − δ)ω(v, B) ≥ (1 − δ)(1 − ε)ω(v, C) ≥ (1 − δ − ε)ω(v, C).
Thus, given a point-set P, we can first extract from it a ε/2-coreset of size O(1/εd−1 ), using Lemma 19.1.2. Let Q denote the
resulting set. We will compute a ε/2-coreset for Q, which would be by the above observation a ε-coreset for directional width of P.
We need the following technical lemma.

Lemma 19.2.3 Let b be a ball of radius r centered at (L, 0, . . . , 0) ∈ IRd , where L ≥ 2r. Let p be an arbitrary point in b, and let b0
be the largest ball centered at p and touching the origin. Then, we have that for µ(p) = min 0 x1 we have µ(p) ≥ −r2 /L.
(x1 ,x2 ,...,xd )∈b

Proof: Clearly, if we move p in parallel to the x1 -axis by decreasing the value of x1 , we are decreasing the value of µ(p). Thus,
in the worst case x1 (p) = L − r. Similarly, the farther away p is from the x1 -axis the smaller µ(p)
p is. Thus, by symmetry, the worst
case is when p = (L − r, r, 0, . . . , 0). See Figure 19.1. The distance between p and the origin is (L − r)2 + r2 , and
p (L − r)2 − (L − r)2 − r2 r2 r2
µ(p) = (L − r) − (L − r)2 + r2 = p ≥− ≥− ,
(L − r) + (L − r)2 + r2 2(L − r) L

since L ≥ 2r.

Lemma 19.2.4 Let Q be a set of m points. Then one can compute a ε/2-coreset for Q of size O(1/ε(d−1)/2 ), in time O(m/ε(d−1)/2 ).

Proof: Note, that byLemma 19.2.1, we can assume that Q is α-fat for some constant α, and v + α[−1, 1]d ⊆ CH(Q) ⊆ [−1, 1]d ,
where v ∈ IRd . In particular, for any√direction u ∈ S(d−1) , we have ω(u, Q) ≥√2α.
Let S be the sphere of radius d + 1 centered at the origin. Set δ = εα/4 ≤ 1/4. One can construct a set I of O(1/δd−1 ) =
O(1/ε(d−1)/2 ) points on the sphere S so that for any point x on S, there exists a point y ∈ I such that kx − yk ≤ δ. We process Q into
a data structure that can answer ε-approximate nearest-neighbor queries. For a query point q, let φ(q) be the point of Q returned
by this data structure. For each point y ∈ I, we compute φ(y) using this data structure. We return the set S = {φ(y) | y ∈ I}; see
Figure 19.2 (ii).
We now show that S is an (ε/2)-coreset of Q. For simplicity, we prove the claim under the assumption that φ(y) is the exact
nearest-neighbor of y in Q. Fix a direction u ∈ S(d−1) . Let σ ∈ Q be the point that maximizes hu, pi over all p ∈ Q. Suppose the ray
emanating from σ in direction u hits S at a point x. We know that there exists a point y ∈ I such that kx − yk ≤ δ. If φ(y) = σ, then
σ ∈ S and
maxhu, pi − maxhu, qi = 0.
p∈Q q∈S

Now suppose φ(y) , σ. Rotate and translate space, such that σ is at the origin, and u is the positive x1 axis. Setting L = koxk and
r = δ, we have that hu, yi ≥ −r2 /L ≥ −δ2 /1 = −δ2 = −εα/4, by Lemma 19.2.3. We conclude that ω(u, S) ≥ ω(u, Q) − 2(εα/4) =
ω(u, Q) − εα/2. On the other hand, since ω(u, Q) ≥ 2α, it follows tat ω(u, S) ≥ (1 − ε/2)ω(u, Q).
As for the running time, we just perform the scan in the most naive way to find φ(y) for each y ∈ I. Thus, the running time is
as stated.

Theorem 19.2.5 Let P be a set of n points in IRd . One can compute a ε-coreset for directional width of P in O(n + 1/ε3(d−1)/2 ) time.
The coreset size is O(ε(d−1)/2 ).

167
B

(y)

y y

CH(P )
w

z x

u h

(i) (ii)

Figure 19.2: (i) An improved algorithm. (ii) Correctness of the improved algorithm.

Proof: We use the algorithm of Lemma 19.2.1 and Lemma 19.1.2 on the resulting set. This computes a ε/2-coreset Q of P of size
O(1/εd−1 ). Next, we apply Lemma 19.2.4 and compute a ε/2-coreset S of Q. This is a ε-coreset of P.

19.3 Exercises
Exercise 19.3.1 [5 Points]
Prove that in the worst case, a ε-coreset for directional width has to be of size Ω(ε−(d−1)/2 ).

19.4 Bibliographical notes

Section 19.1 and Section 19.2.1 is from [AHV04]. The result of Section 19.2.2 was observed independently by Chan [Cha06] and
Yu et al. [YAPV04]. It is a simplification of an algorithm of Agarwal et al. [AHV04] which in turn is an adaptation of a method of
Dudley [Dud74].
The running time of Theorem 19.2.5 can be improved to O(n + 1/εd−1 ) by using special nearest neighbor algorithm on a grid.
Trying to use some of the other ANN data-strictures will not work since it would not improve the running time over the naive
algorithm. The interested reader, can see the paper by Chan [Cha06].

168
Chapter 20

Approximating the Extent of Lines,

Hyperplanes and Moving Points

Once I sat on the steps by a gate of David’s Tower, I placed my two heavy baskets at my side. A group of tourists
was standing around their guide and I became their target marker. “You see that man with the baskets? Just right of
his head there’s an arch from the Roman period. Just right of his head.”
“But he’s moving, he’s moving!”
I said to myself: redemption will come only if their guide tells them, “You see that arch from the Roman period? It’s
not important: but next to it, left and down a bit, there sits a man who’s bought fruit and vegetables for his family.”
– –Yehuda Amichai, Tourists

20.1 Preliminaries
Definition 20.1.1 Given a set of hyperplanes H in IRd , the minimization diagram of H, known as the lower envelope of H, is the
function LH : IRd−1 → IR, where we have L(x) = minh∈H h(x), for x ∈ IRd−1 .
Similarly, the upper envelope of H is the function U(x) = maxh∈H h(x), for x ∈ IRd−1 .
The extent of H and x ∈ IRd−1 is the vertical distance between the upper and lower envelope at x; namely, EH (x) = U(x)−L(x).

20.2 Motivation - Maintaining the Bounding Box of Moving Points

Let P = {p1 , . . . , pn } be a set of n points moving in IRd . For a given time t, let pi (t) = (xi1 (t), . . . , xid (t)) denote the position of pi at
time t. We will use P(t) denote the set P at time t. We say that the motion of P has degree k if every xij (t) is a polynomial of degree
at most k. We call a motion of degree 1 linear. Namely, pi (t) = ai + i t, where ai , i ∈ IRd . The values ai , i are fixed.
Our purpose is to develop efficient approaches for maintaining various descriptors of the extent of P, including the smallest
enclosing orthogonal rectangle of P. This measure indicates how spread out the point set P is. As the points move continuously,
the extent measure of interest changes continuously as well, though its combinatorial realization changes only at certain discrete
times. For example, the smallest orthogonal rectangle containing P can be represented by a sequence of 2d points, each lying on
one of the facets of the rectangle. As the points move, the rectangle also changes continuously. At certain discrete times, the points
lying on the boundary of the rectangle change, and we have to update the sequence of points defining the rectangle. Similarly,
Our approach is to focus on these discrete changes (or events) and track through time the combinatorial description of the extent
measure of interest.
Since we are computing the axis parallel bounding box of the moving points, we can solve the problem in each dimension
separately. Thus, consider the points as points moving linearly in one dimension. The extent B(t) of P(t) is the smallest interval
containing P(t).
It will be convenient to work in a parametric xt-plane in which a moving point p(t) ∈ IR at time t is mapped to the point (t, p(t)).
S
For 1 ≤ i ≤ n, we map the point pi ∈ P to the line `i = t (t, pi (t)), for i = 1, . . . , n. Let L = {`1 , . . . , `n } be the resulting set of lines,
and let A(L) be their arrangement. Clearly, the extent B(t0 ) of P(t0 ) is the vertical interval I(t0 ) in the arrangement A(L) connecting
the upper and lower envelopes of L at t = t0 . See Figure 20.1(i). The combinatorial structure of I(t) changes at the vertices of the
two envelopes of L, and all the different combinatorial structures of I(t) can be computed in O(n log n) time by computing the upper
and lower envelopes of L.

169
upper envelope
outer
extent

I(t) I ε (t) I(t)

inner
extent

lower envelope
(ii) t
(i) t

Figure 20.1: (i) The extent of the moving points, is no more than the vertical segment connecting the lower
envelope to the upper envelope. The black dots mark where the movement description of I(t) changes. (ii)
The approximate extent.

We want to maintain a vertical interval Iε+ (t) so that I(t) ⊆ Iε+ (t) and Iε+ (t) ≤ (1 + ε) |I(t)| for all t, so that the endpoints of Iε+ (t)
follow piecewise-linear trajectories, and so that the number
of combinatorial changes in I ε (t) is small. Alternatively, we want to
−
maintain a vertical interval Iε (t) ⊆ I(t) such that Iε (t) ≥ (1 − ε) |I(t)|. Clearly, having one approximation would imply the other by
−

appropriate rescaling.
Geometrically, this has the following interpretation: We want to simplify the upper and lower envelopes of A(L) by convex
and concave polygonal chains, respectively, so that the simplified upper (resp. lower) envelope lies above (resp. below) the original
upper (resp. lower) envelope and so that for any t, the vertical segment connecting the simplified envelopes is contained in (1+ε)I(t).
See Figure 20.1 (ii).
In the following, we will use duality, see Lemma 23.2.1 for the required properties we will need.

Definition 20.2.1 For a set of hyperplanes H, a subset S ⊂ H is a ε-coreset of H for the extent measure, if for any x ∈ IRd−1 we
have ES ≥ (1 − ε)EH .
Similarly, for a point-set P ⊆ IRd , a set S ⊆ P is a ε-coreset for vertical extent of P, if, for any direction v ∈ S(d−1) , we have that
µv (S) ≥ (1 − ε)µv (P), where µv (P) is the vertical distance between the two supporting hyperplanes of P which are perpendicular to
v.

Thus, to compute a coreset for a set of hyperplanes, it is by duality and Lemma 23.2.1 enough to find a coreset for the vertical
extent of a point-set.

Lemma 20.2.2 The set S is a ε-coreset of the point set P ⊆ IRd for vertical extent if and only if S is a ε-coreset for directional
width.

Proof: Consider any direction v ∈ S(d−1) , and let α be its (smaller) angle with with the xd axis. Clearly, ω(v, S) = µv (S) cos α
and ω(v, P) = µv (PntS et) cos α. Thus, if ω(v, S) ≥ (1 − ε)ω(v, P) then µv (S) ≥ (1 − ε)µv (P), and vice versa.

Theorem 20.2.3 Let H be a set of n hyperplanes in IRd . One can compute a ε-coreset of H of, size O(1/εd−1 ), in O(n +
min(n, 1/εd−1 )) time. Alternatively, one can compute a ε-coreset of size O(1/ε(d−1)/2 ), in time O(n + 1/ε3(d−1)/2 ).

Proof: By Lemma 20.2.2, the coreset computation is equivalent to computing coreset for directional width. However, this can
be done in the stated bounds, by Theorem 19.1.3 and Theorem 19.2.5.
Going back to our motivation, we have the following result:

Lemma 20.2.4 Let P(t) be a set of n points with linear motion in IRd . We can compute an axis parallel moving bounding box
√
b(t) for P(t) that changes O(d/ ε) times (in other times, the bounding box moves with linear motion). The time to compute this
bounding box is O(d(n + 1/ε3/2 )).
Furthermore, we have that Box(P(t)) ⊆ b(t) ⊆ (1 + ε)Box(P(t)), where Box(t) is the minimum axis parallel bounding box of P.

Proof: We compute the solution for each dimension separately. In each dimension, we compute a coreset of the resulting set
of lines in two dimensions, and compute the upper and lower envelope of the coreset. Finally, we expand the upper and lower
envelopes appropriately so that the include the original upper and lower envelopes. The bounds on the running time follows from
Theorem 20.2.3.

170
20.3 Coresets
At this point, our discussion exposes a very powerful technique for approximate geometric algorithms: (i) extract small subset
that represents that data well (i.e., coreset), and (ii) run some other algorithm on the coreset. To this end, we need a more unified
definition of coresets.
d
Definition 20.3.1 (Coresets) Given a set P of points (or geometric objects) in IRd , and an objective function f : 2IR → IR (say,
f (P) is the width of P), a ε-coreset is a subset S of the points of P such that
f (S) ≥ (1 − ε) f (P).
We will state this fact, by saying that S is a ε-coreset of P for f (·).
If the function f (·) is parameterized, namely f (Q, v), then S ⊆ P is a coreset if
∀v f (S, v) ≥ (1 − ε) f (P, v).
As a concrete example, for v a unit vector, consider the function ω(v, P) which is the directional width of P; namely, it is the
length of the projection of CH(P) into the direction of v.

Coresets are of interest when they can be computed quickly, and have small size, hopefully of size independent of n, the size
of the input set P. Interestingly, our current techniques are almost sufficient to show the existence of coresets for a large family of
problems.

20.4 Extent of Polynomials

Let F = { f1 , . . . , fn } be a family of d-variate polynomials and let u1 , . . . , ud be the variables over which the functions of F are
defined. Each fi corresponds to a surface in IRd+1 : For example, any d-variate linear function can be considered as a hyperplane in
IRd+1 (and vice versa). The upper/lower envelopes and extent of F can now be defined similar to the hyperplane case.
Each monomial over u1 , . . . , ud appearing in F can be mapped to a distinct variable xi . Let x1 , . . . , x s be the resulting variables.
As such F can be linearized into a set H = {h1 , . . . , hn } of linear functions over IR s . In particular, H is a set of n hyperplanes in
IR s+1 . Note that the surface induced by fi in IRd+1 corresponds only to a subset of the surface of hi in IR s+1 . This technique is called
linearization.
For example, consider a family of polynomials F = { f1 , . . . , fn }, where fi (x, y) = ai (x2 + y2 ) + bi x + ci y + di , and ai , bi , ci , di ∈ IR,
for i = 1, . . . , n. This family of polynomials defined over IR2 , can be linearized to a family of linear functions defined over IR3 , by
hi (x, y, z) = ai z + bi x + ci y + di , and setting H = {h1 , . . . , hn }. Clearly, H is a set of hyperplanes in IR4 , and fi (x, y) = hi (x, y, x2 + y2 ).
Thus, for any point (x, y) ∈ IR2 , instead of evaluating F on (x, y), we can evaluate H on η(x, y) = (x, y, x2 + y2 ), where η(x, y) is
the linearization image of (x, y). The advantage of this linearization is that H, being a family of linear functions, is now easier to
handle than F .
Observe, that X = η(IR2 ) is a subset of IR3 (this is the “standard” paraboloid), and we are interested in the value of H only on
points belonging to X. In particular, the set X is not necessarily convex. The set X resulting from the linearization is a semi-algebraic
set of constant complexity, and as such basic manipulation operations of X can be performed in constant time.
Note that for each 1 ≤ i ≤ n, fi (p) = hi (η(p)) for p ∈ IRd . As such, if H 0 ⊆ H is a ε-coreset of H for the extent, then clearly
the corresponding subset in F is a ε-coreset of F for the extent measure. The following theorem is a restatement of Theorem 20.2.3
in this settings.

Theorem 20.4.1 Given a family of d-variate polynomials F = { f1 , . . . , fn }, and parameter ε, one can compute, in O(n + 1/ε s ) time,
a subset F 0 ⊆ F of O(1/ε s ) polynomials, such that F 0 is a ε-coreset of F for the extent measure. Here s is the number of different
monomials present in the polynomials of F .
Alternatively, one can compute a ε-coreset, of size O(1/ε s/2 ), in tiem O(n + 1/ε3s/2 ).

20.5 Roots of Polynomials

We now consider the problem of approximating the extent a family of square-roots of polynomials. Note, that this is considerably
harder than handling polynomials because square-roots of polynomials can not be directly linearized. It turns out, however, that it
is enough to O(ε2 )-approximate the extent of the functions inside the roots, and take the root of the resulting approximation.
n o
Theorem 20.5.1 Let F = ( f1 )1/2 , . . . , ( fn )1/2 be a family of k-variate functions (over p = (x1 , . . . , xk ) ∈ IRk ), where each fi is a
0
polynomial that is non-negative for every p ∈ IRk . Given any ε > 0, we can compute, in O(n + 1/ε2k ) time, a ε-coreset G ⊆ F of
2k0 0
size O(1/ε ), for the measure of the extent. Here k is the number of different monomials present in the polynomials in f1 , . . . , fn .
0 0
Alternatively, one can compute a set G0 ⊆ F , in O(n + 1/ε3k ) time, that ε-approximates F , so that |G0 | = O(1/εk ).

171
Proof: Let F 2 denote the family { f1 , . . . , fn }.n Using the algorithm
o of Theorem 20.4.1, we compute a δ0 -coreset G2 ⊆ F 2 of F 2 ,
0 2 1/2 2
where δ = ε /64. Let G ⊆ F denote the family ( fi ) | fi ∈ G .
Consider any point x ∈ IRk . We have that EG2 (x) ≥ (1 − δ0 )EF 2 (x), and let a = LF 2 (x), A = LG2 (x), B = UG2 (x), and
b = UF 2 (x). Clearly, we have 0 ≤ a ≤ A ≤ B ≤ b and B − A ≥ (1 − δ0 )(b − a). Since (1 + 2δ0 )(1 − δ0 ) ≥ 0, we have that
(1 + 2δ0 )(B − A) ≥ b − a. √ √ √ √ √ √
√ By√ Lemma 20.5.2 √ we have that Then, A − a ≤ (ε/2)U , and b − B ≤ (ε/2)U, where U = B − A. Namely,
√ below,
B − A ≥ (1 − ε)( b − a). Namely, G is a ε-coreset for the extent of F .
The bounds on the size of G and the running time are easily verified.

Lemma
√ √ 20.5.2 Let 0 ≤ a √
≤ A ≤√B ≤ b, and 0 < ε ≤ 1 be given
√ parameters,
√ so that b − a ≤ (1 + δ)(B − A), where δ = ε2 /16. Then,
A − a ≤ (ε/2)U , and b − B ≤ (ε/2)U, where U = B − A.

Proof: Clearly,
√ √ √ √ √ √ √ √ √ √ √
A+ B ≤ a + A − a + b ≤ a + δb + b ≤ (1 + δ)( a + b).
√ √
A+√ B √ √
Namely, 1+ δ
≤ a + b. On the other hand,
√ √ b−a (1 + δ)(B − A) √ B−A
b− a = √ √ ≤ √ √ ≤ (1 + δ)(1 + δ) √ √
b+ a b+ a B+ A
√ √ √ √
= ‘(1 + ε2 /16)(1 + ε/4)( B − A) ≤ (1 + ε/2)( B − A).

20.6 Applications
20.6.1 Minimum Width Annulus
Let P = {p1 , . . . , pq
n } be a set of n points in the plane. Let fi (q) denote the distance of the ith point from the point q. It is easy to
2 2
verify that fi (q) = xq − x pi + yq − y pi . Let F = { f1 , . . . , fn }. It is easy to verify that for a center point x ∈ IR2 , the width of the
minimum width annulus containing P which is centered at x has width EF (x). Thus, we would like to compute a ε-coreset for F .
2 2
Consider the set of functions F 2 . Clearly, fi2 (x, y) = x − x pi + y − y pi = x2 − 2x pi x + x2pi + y2 − 2y pi y + y2pi . Clearly,
all the functions of F 2 have this (additive) common
factor of x2 + y2 . Since we only care about the vertical extent, we have

H = −2x p x + x2 − 2y p y + y2 i = 1, . . . , n has the same extent as F 2 ; formally, for any x ∈ IR2 , we have EF 2 (x) = EH (x).
i pi i pi

Now, H is just a family of hyperplanes in IR3 , and it has a ε2 /64-coreset SH for the extent of size 1/ε which can be computed in
O(n + 1/ε3 ) time. This corresponds to a ε2 /64-coreset SF 2 of F 2 . By Theorem 20.5.1, this corresponds to a ε-coreset SF of F .
Finally, this corresponds to coreset S ⊆ P of size O(1/ε), such that the minimum width annulus of S, if we expand it by (1 + 2ε),
it contains all the points of P. Thus, we can just find the minimum width annulus of S. This can be done in O(1/ε2 ) time using an
exact algorithm. Putting everything together, we get:

Theorem 20.6.1 Let P be a set of n points in the plane, and let 0 ≤ ε ≤ 1 be a parameter. One can compute a (1 + ε)-approximate
minimum width annulus to P in O(n + 1/ε3 ) time.

20.7 Exercises
20.8 Bibliographical notes
Linearization was widely used in fields such as machine learning [CS00] and computational geometry [AM94].
There is a general technique for finding the best possible linearization (i.e., a mapping η with the target dimension as small as
possible), see [AM94] for details.

172
Chapter 21

Approximating the Extent of Lines,

Hyperplanes and Moving Points II

Drug misuse is not a disease, it is a decision, like the decision to step out in front of a moving car. You would call
that not a disease but an error in judgment. When a bunch of people begin to do it, it is a social error, a life-style. In
this particular life-style the motto is “be happy now because tomorrow you are dying,” but the dying begins almost
at once, and the happiness is a memory. ... If there was any “sin,” it was that these people wanted to keep on
having a good time forever, and were punished for that, but, as I say, I feel that, if so, the punishment was far too
great, and I prefer to think of it only in a Greek or morally neutral way, as mere science, as deterministic impartial
cause-and-effect.
– A Scanner Darkly, Philip K. Dick

21.1 More Coresets

21.1.1 Maintaining certain measures of moving points
We show that our techniques can be extended to handle other measure of moving points (width, diameter, etc). Let P = {p1 , . . . , pn }
be a set of n points in IRd , each moving independently. Let pi (t) = (pi1 (t), . . . , pid (t)) denote the position of point pi at time t. Set
P(t) = {pi (t) | 1 ≤ i ≤ n}. If each pi j is a polynomial of degree at most r, we say that the motion of P has degree r. We call the
motion of P linear if r = 1 and algebraic if r is bounded by a constant.
Given a parameter ε > 0, we call a subset Q ⊆ P an ε-coreset of P for directional width if for any direction u ∈ S(d−1) , we have

(1 − ε)ω(u, P(t)) ≤ ω(u, Q(t)) for all t ∈ IR.

21.1.1.1 Computing an ε-coreset for directional width.

First let us assume that the motion of P is linear, i.e., pi (t) = ai + bi t, for 1 ≤ i ≤ n, where ai , bi ∈ IRd . For a direction
u = (u1 , . . . , ud ) ∈ S(d−1) , we define a (d + 1)-variate polynomial

X
d X
d
fi (u, t) = hpi (t), ui = hai + bi t, ui = ai j u j + bi j · (tu j ).
j=1 j=1

Set F = { f1 , . . . , fn }. Then

ω(u, P(t)) = max hpi (t), ui − min hpi (t), ui = max fi (u, t) − min fi (u, t) = EF (u, t).
i i i i

Since F is a family of (d + 1)-variate polynomials, which admits a linearization of dimension 2d (there are 2d monomials), using
Theorem 20.4.1, we conclude the following.

Theorem 21.1.1 Given a set P of n points in IRd , each moving linearly, and a parameter ε > 0, we can compute an ε-coreset of P
for directional width of size O(1/ε2d ), in O(n + 1/ε2d ) time, or an ε-coreset of size O(1/εd ) in O(n + 1/ε3(d) ) time.

173
If the degree of motion of P is r > 1, we can write the d-variate polynomial fi (u, t) as:
X
r Xr D E
fi (u, t) = hpi (t), ui = ai j t j , u = ai j t j , u
j=0 j=0

where ai j ∈ IRd . A straightforward extension of the above argument shows that fi ’s admit a linearization of dimension (r + 1)d.
Using Theorem 20.4.1, we obtain the following.

Theorem 21.1.2 Given a set P of n moving points in IRd whose motion has degree r > 1 and a parameter ε > 0, we can compute
an ε-coreset for directional width of P of size O(1/ε(r+1)d ) in O(n + 1/ε(r+1)d ) time, or of size O(1/ε(r+1)d /2) in O(n + 1/ε3(r+1)d/2 )
time.

21.1.2 Minimum-width cylindrical shell

Let P = {p1 , . . . , pn } be a set of n points in IRd , and a parameter ε > 0. Let w∗ = w∗ (P) denote the width of the thinnest
cylindrical shell, the region lying between two co-axial cylinders, containing P. Let d(`, p) denote the distance between a point
p ∈ IRd and a line ` ⊂ IRd . If we fix a line `, then the width of the thinnest cylindrical shell with axis ` and containing P is
w(`, P) = max p∈P d(`, p) − min p∈P d(`, p). A line ` ∈ IRd not parallel to the hyperplane xd = 0 can be represented by a (2d − 2)-tuple
(x1 , . . . , x2d−2 ) ∈ IR2d−2 :
` = {p + tq | t ∈ IR} ,
where p = (x1 , . . . , xd−1 , 0) is the intersection point of ` with the hyperplane xd = 0 and q = (xd , . . . , x2d−2 , 1) is the orientation of
` (i.e., q is the intersection point of the hyperplane xd = 1 with the line parallel to ` and passing through the origin). (The lines
parallel to the hyperplane xd = 0 can be handled separately by a simpler algorithm, so let us assume this case does not happend.) The
distance between ` and a point p is the same as the distance of the line `0 = {(p − p) + tq | t ∈ IR} from the origin; see Figure 21.1.
The point y on `0 closest to the origin satisfies y = (p − p) + tq for some t, and at the same time hy, qi = h(p − p) + tq, qi = 0, which
implies that t = − h(p − p), qi / kqk2 . Thus,

[hp − p, qi]q
d(`, p) = kyk = k(p − p) + tqk = (p − p) − ,
kqk2
Define fi (`) = fi (p, q) = d(`, pi ), and set F = { fi | pi ∈ P}. Then w∗ = min x∈IR2d−2 EF (x). (We assume for simplicity that the axis
of the optimal shell is not parallel to the hyperplane xd = 0.) Let fi0 (p, q) = kqk2 · fi (p, q) = kqk2 (p − pi ) − hp − pi , qi q , and set
n o
F 0 = f10 , . . . , fn0 .
2
Define gi (p, q) = fi0 (p, q) , and let G = {g1 . . . , gn }. Then gi is a (2d − 2)-variate polynomial and has O(d2 ) monomials.
2

Therefore G admits a linearization of dimension O(d2 ). By Theorem 20.4.1, we compute a O(ε2 )-coreset of G of size O 1/εd in
2

O n + 1/εO(d ) . This in turn corresponds to a ε-coreset of F 0 , by Theorem 20.5.1. It is now easy to verify that this corresponds to
a ε-coreset (for the extent) for F . Finally, this corresponds to a subset S ⊆ P, such that S is a coreset of P for w(`, P). Formally, a
subset S ⊆ P is a ε-coreset for cylindrical shell width , if

w(`, S) ≥ (1 − ε)w(`, P), for all v ∈ S(d−1) .

2 2
Thus, we can compute in O(n + 1/εO(d ) ) time a set Q ⊆ P of 1/εO(d ) points so that for any line `, w(`, P) ≥ w(`, Q) ≥
(1 − ε)w(`, P) as well as a cylindrical shell of width at most (1 + ε)w∗ (P) that contains P. Hence, we conclude the following.

2
Theorem 21.1.3 Given a set P of n points in IRd and a parameter ε > 0, we can compute in O(n + 1/εO(d ) ) time a subset S ⊆ P of
2
size O(1/εO(d ) ) so that for any line ` in IRd , we have w(`, S) ≥ (1 − ε)w(`, P).

Note, that Theorem 21.1.3 does not compute the optimal cylinder, it just computes a small coreset for this problem. Clearly,
4
we can now run any brute force algorithm on this coreset. This would result in running time O(n + 1/εO(d ) ), which would output a
cylinder which if expanded by factor 1 + ε, will cover all the points of P. In fact, the running time can be further improved.

21.2 Exercises
21.3 Bibliographical notes
Section 21.1.1 is from Agarwal et al. [AHV04], and the results can be (very slightly) improved by treating the direction as (d − 1)-
dimensional entity, see [AHV04] for details.

174

Figure 21.1: Parameterization of a line ` in IR3 and its distance from a point p; the small hollow circle on `
is the point closest to p.

Theorem 21.1.3 is also from [AHV04]. The improved running time to compute the approximate cylinder mentioned in the text,
follows by a more involved algorithm, which together with the construction of the coreset, also compute a compact representation
of the extent of the coreset. The technical details are not trivial, and we skip them. In particular, the resulting running time for
2
computing the approximate cylindrical shell is O(n + 1/εO(d ) ). See [AHV04] for more details.

Computing a compact representation of the extent of hyperplanes. FILL IN

175
176
Chapter 22

Approximation Using Shell Sets

“And so ended Svejk’s Budejovice anabasis. It is certain that if Svejk had been granted liberty of movement he
would have got to Budejovice on his own. However much the authorities may boast that it was they who brought
Svejk to his place of duty, this is nothing but a mistake. With Svejk energy and irresistible desire to fight, the
authorities action was like throwing a spanner into the works.”
– – The good soldier Svejk, Jaroslav Hasek

22.1 Covering problems, expansion and shell sets

Consider a set P of n points in IRd , that we are interested in covering by the best shape in a family of shapes F. For example, F
might be the set of all balls in IRd , and we are looking for the minimum enclosing ball of P. A ε-coreset S ⊆ P would guarantee
that any ball that covers S will cover the whole point set if we expand it by (1 + ε).
However, sometimes, computing the coreset is computationally expensive, the coreset does not exist at all, or its size is
prohibitively large. It is still natural to look for a small subset S of the points, such that finding the optimal solution for S generates
(after appropriate expansion) an approximate solution to the original problem.

Definition 22.1.1 (Shell sets) Given a set P of points (or geometric objects) in IRd , and F be a family of shapes in IRd . Let
f : F → IR be a target optimization function, and assume that there is a natural expansion operation defined over F. Namely, given
a set r ∈ F, one can compute a set (1 + ε)r which is the expansion of r by a factor of 1 + ε. In particular, we would require that
f ((1 + ε)r) ≤ (1 + ε) f (r).
Let f opt (P) = minr∈F,P⊆r f (r) be the shape in F that bests fits P.
Furthermore, assume that f opt (·) is a monotone function, that is for A ⊆ B ⊆ P we have f opt (A) ≤ f opt (B).
A subset S ⊆ P is a ε-shell set for P, if SlowAlg on a set B that contains S, if the range r returned by SlowAlg(S) covers
S, (1 + ε)r covers P, and f (r) ≤ (1 + ε) f opt (S). Namely, the range (1 + ε)r is an (1 + ε)-approximation to the optimal range of F
covering P.
A shell set S is a monotone ε-shell set if for any subset B containing S, if we apply SlowAlg(B) and get the range r, then P
is contained inside (1 + ε)r and r covers B.

Note, that ε-shell sets are considerably more restricted and weaker than coresets. Of course, a ε-coreset is automatically a
(monotone) ε-shell set. Note also, that if a problem has a monotone shell set, then to approximate it efficiently, all we need to do is
to find some set, hopefully small, that contains the shell set.

22.2 The Setting

Let P be a set of n points in IRd , and let F be a family of shapes in IRd . Furthermore, let us assume that the range space X = (IRd , F)
has low VC dimension dimVC . Finally, assume that we want to compute the best shape in F that covers P under a target function
f (·). Namely, we would like to compute f opt (P) = minr∈F,P⊆B f (r).
Assume that f opt (·) is a monotone target function. Namely, if S ⊆ T ⊆ P then f opt (S ) ≤ f opt (T ). This monotonicity property
holds (almost) always for problems with small coresets.
Next, assume that we only have a slow algorithm SlowAlgthat can solve (maybe approximately) the given optimization prob-
lem. Namely, given a subset S ⊆ P, it computes a range r ∈ F such that f (r) ≤ (1 + ε) f opt (S ), and the running time of SlowAlg is
T SlowAlg (|S |).

177
ComputeShellSet(P)
We initialize all the points of P to have weight 1, and we repeatedly do the following:
• Pick a random sample R from P of size r = O((dimVC /δ) log(dimVC /δ)), where δ =
1/(4kopt ). With constant probability R is a δ-net for P by Theorem 5.3.4.
• Compute, using SlowAlg(R) the range r in F, such that (1 + ε)r covers R and realizes
(maybe approximately) f opt (R).
• Compute the set S of all the points of P outside (1 + ε)r. If the total weight of those points
exceeds δw(P) then the random sample is bad, and return to the first step.
• If the set S is empty then return R as the required shell set, and r as the approximation.
• Otherwise, double the weight of the points of S .
When done, return r and the set R.

Figure 22.1: The algorithm for approximating optimal cover and computing a small shell set.

Finally, assume that we know that a small monotone shell set of size kopt exists for P, but unfortunately we have no way of
computing it explicitly (because, for example, we only have a constructive proof of the existence of such a shell set).
A natural question is how to compute this small shell set quickly, or alternatively compute an approximate shell set which is
not much bigger. Clearly, once we have such a small shell set, we can approximate the optimal cover for P in F.

Example. We start with a toy example, a more interesting example is given below. Let F be the set of all balls in IRd , and let
f (r) be the radius of the ball r ∈ F. It is known that there is a ε-shell set for the minimum radius ball of size O(1/ε) (we will prove
this fact later in the course). The expansion here is the natural enlargement of a ball radius.

22.3 The Algorithm for Computing the Shell Set

Assume, that a kind oracle, told us that the there exist a monotone ε-shell set for P of size kopt , and that F is of VC dimension
dimVC . The algorithm to approximate the optimal cover of P and extract a small shell set is depicted in Figure 22.1. Note, that if
we do not have a kind oracle at our possession, we can just perform a binary search for the right value of kopt .
There are several non-trivial technicalities in implementing this algorithm The first one is that Theorem 5.3.4 is for unweighted
sets, but by replicating a point p of w p times (conceptually), where w p is the weight of p, it follows that it still holds in this weighted
settings.

Random sampling from a weighted set. Another technicality is that the weights might be quite large. To overcome
this, we will store the weight of an element by storing an index i, such that the weight of the element is 2i , We still need to do m
independent draws from this weighted set. The easiest way to do that, is to compute the element e in of P in maximum weight,
and observing that all elements of weight ≤ we /n10 have weight which is so tiny, so that it can be ignored, where w p is the weight
of e. Thus, normalize all the weights of by dividing them by 2blg we /n c , and remove all elements with weights smaller than 1. For
10

ω(p) denote its normalized weight. Clearly, all the normalized weights are integers in the range 1, . . . , 2n10 . Thus,
a point p, let b
we now have to pick points for a set with (small) integer weights. Place the elements in an array, and compute the prefix sum
P
array of their weights. That is ak = ki=1 b
ω(pi ), for i = 1, . . . , n. Next, pick a random number γ uniformly in the range [0, an ], and
using a binary search, find the j, such that a j−1 ≤ γ < a j . This picks the points p j to be in the random sample. This requires O(n)
preprocessing, but a single random sample can now be done in O(log n) time. We need to perform r independent samples. Thus,
this takes O(n + r log n).

22.3.1 Correctness
Lemma 22.3.1 The algorithm described above computes a ε-shell set for P of size O(r) = O(kopt dimVC log (kopt dimVC )). The
algorithm performs O(4kopt ln n) iterations.

Proof: We only need to prove that the algorithm terminates in the claimed number of iterations. Observe, that with constant
probability (say ≥ 0.9), the sample Ri , in the ith iteration, is an δ-net for Pi−1 (the weighted version of P in the end of the (i −

178
1)th iteration),
in relation to the ranges of F. Observe, that this also implies that Ri is a δ-net for the complement family F =
d
IR \ r r ∈ F , with constant probability (since (IRd , F) and (IRd , F) have the same VC dimension).
If Ri is such a δ-net, then we know that range r we compute completely covers the set Ri , and as such, for any range r0 ∈ F that
avoids Ri we have w(r0 ) ≤ δw(Pi ). In particular, this implies that ω(S i ) ≤ δw(Pi−1 ). If not, than Ri is not a δ-net, and we resample.
The probability for that is ≤ 0.1. As such, we expect to repeat this O(1) times in each iteration, till we have w(S i ) ≤ δw(Pi−1 ).
Thus, in each iteration, the algorithm doubles the weight of at most a δ-fraction of the total point set. Thus w(Pi ) ≤ (1 +
δ)w(Pi−1 ) = n(1 + δ)i .
On the other hand, consider the smallest shell S of P, which is of size kopt . If all the elements of S are in Ri , then the algorithm
would have terminated, since S is a monotone shell set1. Thus, if we continue to the next iteration, it must be that |S ∩ S i | ≥ 1. In
particular, we are doubling the weight of at least one element of the shell set. We conclude that the weight of Pi in the ith iteration,
is at least
kopt 2i/kopt ,
since in every iteration at least one element of S gets its weight redoubled. Thus, we have
! !
i i
exp ≤ 2i/kopt ≤ kopt 2i/kopt ≤ (1 + δ)i n ≤ n · exp(δi) = n · exp .
2kopt 4kopt

Namely, exp 4kiopt ≤ n. Implying that i ≤ 4kopt ln n. Namely, after 4kopt ln n iterations the algorithm terminates, and thus returns the
required shell set and approximation.

Theorem 22.3.2 Under the settings of Section 22.2,one can compute a monotone ε-shell set for P of size O(kopt dimVC log(kopt dimVC )).
The running time of the resulting algorithm is O (n + T (kopt dimVC log(kopt dimVC )))kopt ln n , with high probability, for kopt ≤
n/ log3 n. Furthermore, one can compute an ε-approximation to f opt (P) in the same time bounds.

Proof: The algorithm is described above. The bounds on the running time follows from the bounds on the number of iterations
from Lemma 22.3.1. The only problem we need to address, is that the resampling would repeatedly fail, and the algorithm would
spend exuberant amount of time on resampling. However, the probability of failure in sampling is ≤ 0.1. Furthermore, we need at
most 4kopt log n good samples before the algorithm succeeds. It is now straightforward to show using Chernoff inequality, that with
high probability, we will perform at most 8kopt log n samplings before achieving the required number of good samples.

22.3.2 Set Covering in Geometric Settings

Interestingly, the algorithm we discussed, can be used to get an improved approximation algorithm for the set covering problem in
geometric settings. We remind the reader that set covering is the following problem.

Problem: Set Covering

Instance: (S , F)
S - a set of n elements [
F - a family of subsets of S , s.t. X = S.
X∈F
Question: What is the set X ⊆ F such that X contains as few sets as possible, and X covers S ?

The natural algorithm for this problem is the greedy algorithm that repeatedly pick the set in the family F that covers the largest
number of uncovered elements in S . It is not hard to show that this provides a O(|S |) approximation. In fact, it is known that set
covering can be better approximated unless P = NP.
Assume, however, that we know that the VC dimension of the set system (S , F) has VC dimension dimVC . In fact, we need a
stronger fact, that the dual family
S = F, U(s, F) s ∈ S ,

is of low VC dimension dimVC , where U(s, F) = X s ∈ X, X ∈ F .
It turns out that the algorithm of Figure 22.1 also works in this setting. Indeed, we set the weight of the sets to 1, we pick
a random sample of sets. IF they cover the universe S , we are done. Otherwise, there must be a point p which is not covered.
Arguing as above, we know that the random sample is a δ-net of (the weighted) S, and as such all the sets containing p have total
weight ≤ δ(S). As such, double the weight of all the sets covering p, and repeat. Arguing as above, one can show that the algorithm
terminates after O(kopt log m) iterations, where m is the number of sets, where kopt is the number of sets in the optimal cover of S .
Furthermore, the size of the cover generated is O(kopt dimVC log(kopt dimVC )).

179
Theorem 22.3.3 Let (S , F) be a range space, such that the dual range space S has VC dimension dimVC . Then, one can compute a
set covering for S using sets of F using O(kopt dimVC log(kopt dimVC )) sets. This requires O(kopt log n) iterations, and takes polynomial
time.

Note, that we did not provide in Theorem 22.3.3 exact running time bounds. Usually in geometric settings, one can get
improved running time using the underlying geometry. Interestingly, the property that the dual system has low VC dimension
“buys” one a lot, as it implies that one can do O(log kopt ) approximation, instead of O(log n) in the general case.

22.4 Application - Covering Points by Cylinders

22.5 Clustering and Coresets

We would like to cover a set P of n points IRd by k balls, such that the radius of maximum radius ball is minimized. This is known
as the k-center clustering problem (or just k-center). The price function, in this case, rdk (P) is the radius of the maximum radius
ball in the optimal solution.

Definition 22.5.1 Let P be a point set in IRd , 1/2 > ε > 0 a parameter.
For a cluster c, let c(δ) denote the cluster resulting form expanding c by δ. Thus, if c is a ball of radius r, then c(δ) is a ball of
radius r + δ. For a set C of clusters, let
C(δ) = c(δ) c ∈ C ,

be the additive expansion operator; that is, C(δ) is a set of clusters resulting form expanding each cluster of C by δ.
Similarly,

(1 + ε)C = (1 + ε)c c ∈ C ,

is the multiplicative expansion operator, where (1 + ε)c is the cluster resulting from expanding c by a factor of (1 + ε). Namely, if
C is a set of balls, then (1 + ε)C is a set of balls, where a ball c ∈ C, corresponds to a ball radius (1 + ε) radius(c) in (1 + ε)C.
A set S ⊆ P is an (additive) ε-coreset of P, in relation to a price function radius, if for any clustering C of S, we have that P is
covered by C(ε radius(C)), where radius(C) = maxc∈C radius(c). Namely, we expand every cluster in the clustering by an ε-fraction
of the size of the largest cluster in the clustering. Thus, if C is a set of k balls, then C(ε f (C)) is just the set of balls resulting from
expanding each ball by εr, where r is the radius of the largest ball.
A set S ⊆ P is a multiplicative ε-coreset of P, if for any clustering C of S, we have that P is covered by (1 + ε)C.

Lemma 22.5.2 Let P be a set of n points in IRd , and ε > 0 a parameter. There exists an additive ε-coreset for the k-center problem,
and this coreset has O(k/εd ) points.

Proof: Let C denote the optimal clustering of P. Cover each ball of C by a grid of side length εropt /d, where ropt is the radius
of the optimal k-center clustering of P. From each such grid cell, pick one points of P. Clearly, the resulting point set S is of size
O(k/εd ) and it is an additive coreset of P.
The following is a minor extension of an argument used in [APV02].

d
Lemma 22.5.3 Let P be a set of n points
in IR , and ε > 0 a parameter. There exists a multiplicative ε-coreset for the k-center
problem, and this coreset has O k!/εdk points.

Proof: For k = 1, the additive coreset of P is also a multiplicative coreset, and it is of size O(1/εd ).
As in the proof of Lemma 22.5.2, we cover the point set by a grid of radius εropt /(5d), let SQ the set of cells (i.e., cubes) of
this grid which contains points of P. Clearly, |SQ| = O(k/εd ).
Let S be the additive ε-coreset of P. Let C be any k-center clustering of S, and let ∆ be any cell of SQ.
If ∆ intersects all the k balls of C, then one of them must be of radius at least (1 − ε/2)rd(P, k). Let c be this ball. Clearly, when
we expand c by a factor of (1 + ε) it would completely cover ∆, and as such it would also cover all the points of ∆ ∩ P.
Thus, we can assume that ∆ intersects at most k − 1 balls of C. As such, we can inductively compute an ε-multiplicative coreset
S
of P ∩ ∆, for k − 1 balls. Let Q∆ be this set, and let Q = S ∪ ∆∈SQ Q∆ .

Note that |Q| = T (k, ε) = O(k/εd )T (k − 1, ε) + O(k/εd ) = O k!/εdk . The set Q is the required multiplicative coreset by the
above argumentation.

180
22.6 Union of Cylinders
Let assume we want to cover P by k cylinders of minimum maximum radius (i.e., fit the points to k lines). Formally, consider G
to be the set of all cylinders in IRd , and let F = c1 ∪ c2 ∪ . . . ∪ ck c1 , . . . , ck ∈ F be the set, which its members are union of k
cylinders. For C ∈ F, let f (C) = maxc∈C radius(c). Let f opt (P) = minC∈F,P⊆C f (C).
One can compute the optimal cover of P by k cylinders in O(n(2d−1)k+1 ) time, see below for details. Furthermore, (IRd , F) has
VC dimension dimVC = O(dk log(dk)). Finally, one can show that this set of cylinders has ε-coreset of small size ???. Thus, we
would like to compute a small ε-coreset, and compute an approximation quickly.

22.6.0.1 Covering by Cylinders - A Slow Algorithm

It is easy to verify (but tedious) that the VC dimension of (IRd , Fk ) is bounded by dimVC = O(dk log(dk)). Furthermore, it has a
small coreset Furthermore, given a set P of n points, and consider its minimum radius enclosing cylinder c. The cylinder c has (at
most) 2d − 2 points of P on its boundary which if we compute their minimum enclosing cylinder, it is c. Note, that c might contain
even more points on its boundary, we are only claiming that there is a defining subset of size 2d − 1. This is one of those “easy
to see” but very tedious to verify facts. Let us quickly outline an intuitive explanation (but not a proof!) of this fact. Consider
the set of lines Ld of lines in IRd . Every member of ` ∈ Ld can be parameterized by the closest point p on ` to the origin, and
consider the hyperplane that passes through p and is orthogonal to op, where o is the origin. The line ` now can be parameterized
by its orientation in h. This requires specifying a point on the d − 2 dimensional unit hypersphere S(d−2) . Thus, we can specify `
using 2d − 2 real numbers. Next, define for each point pi ∈ P, its distance gi (`) from ` ∈ Ld . This is a messy but a nice algebraic
function defined over 2d − 2 variables. In particular, gi (`) induces a surface in 2d − 1 dimensions (i.e., ∪` (`, gi (`)). Consider the
arrangement A of those surfaces. Clearly, the minimum volume cylinder lies on a feature of this arrangement, thus to specify the
minimum radius cylinder, we just need to specify the feature (i.e., vertex, edge, etc) of the arrangement that contains this point.
However, every feature in an arrangement of well behaved surfaces in 2d − 1 dimensions, can be specified by 2d − 1 surfaces. (This
is intuitively clear but requires a proof - an intersection of k surfaces, is going to be d − k dimensional, where d is the dimension of
the surfaces. If we add a surface to the intersection and it does not reduce the dimension of the intersection, we can reject it, and
take the next surface passing through the feature we care about.). Every such surface corresponds to a original point.
Thus, if we want to specify a minimum radius cylinder induced by a subset of P, all we need to specify are 2d − 1 points. To
specify k such cylinders, we need to specify M = (2d − 1)k points. This immediately implies that we can find the optimal cover of
P by k cylinders in O(n(2d−1)k+1 ) time, but just enumerating all such subsets of M points, and computing for each subset its optimal
cover (note, that the O notation hides a constant that depends on k and d).
Thus, we have a slow algorithm that can compute the optimal cover of P by k cylinders.

22.6.1 Existence of a Small Coreset

Since the coreset in this case is either multiplicative or additive, it is first important to define the expansion operation carefully. In
particular, if C is a set of k cylinders, the (1 + ε)-expanded set of cylinders would be C(ε radius(C)), where radius(C) is the radius
of the largest cylinder in C.
Let P be the given set of n points in IRd . Let Copt be the optimal cover of P by k cylinders. For each cylinder of Copt place
O(1/εd−1 ) parallel lines inside it, so that for any point inside the union of the cylinders, there is a line in this family in distance
≤ (ε/10)ropt from it. Let L denote this set of lines.
Let Q be the point set resulting from snapping each point of P to its closest point on L. We claim that Q is a (ε/10)-coreset for P,
as can be easily verified. Indeed, if a set C of k cylinders cover Q, then the largest cylinder must be of radius r ≥ (1 − ε/10)rd(P, k),
where rd(P, k) is the radius of the optimal coverage of P by k cylinders. Otherwise, rd(P, k) ≤ r + (ε/10)rd(P, k) < rd(P, k).
The set Q lies on O(1/εd−1 )-lines. Let ` ∈ L be such a line, and consider the point-set Q` . Assume for a second that there was
a multiplicative ε-coreset T` on this line. If T` is covered by k cylinders, each cylinder intersect ` along an interval. Expanding
each such cylinder by a factor of (1 + ε) is equivalent to expanding each such intersecting interval by a factor of 1 + ε. However,
by Lemma 22.5.3, we know that such a multiplicative (ε/10)-coreset exists, of size O(1/εk ). Thus, let T` be the multiplicative
(ε/10)-coreset for Q` for k intervals on the line. Let T = ∪`∈L T` . We claim that T is a (additive) (ε/10)-coreset for Q. This is trivial,
since being a multiplicative coreset for each line implies that the union is a multiplicative coreset, and a δ-multiplicative coreset is
also a δ-additive coreset. Thus, T is a ((1 + ε/10)2 − 1)-coreset for P. The only problem is that the points of T are not points in P.
However, they corresponds to points in P which are in distance at most (ε/10)rd(P, k) from them. Let S be the corresponding set of
points of P. It is now easy to verify that S is indeed a ε-coreset for P, since ((1 + ε/10)2 − 1) + ε/10 ≤ ε. We summarize:

Theorem 22.6.1 Let P be a set of n points in IRd . There exists a (additive) ε-coreset for P of size O(k/εd−1+k ) for covering the
points by k-cylinders of minimum radius.

181
22.7 Bibliographical notes
Section 22.3.2 is due to Clarkson [Cla93]. This technique was used to approximate terrains [AD97], and covering polytopes
[Cla93].
The observation that this argument can be used to speedup approximation algorithms is due to Agarwal et al. [APV02]. The
discussion of shell sets is implicit in the work of Bădoiu et al. [BHI02].

182
Chapter 23

Duality

Duality is a transformation that maps lines and points into points and lines, respectively, while preserving some properties in
the process. Despite its relative simplicity, it is a powerful tool that can dualize what seems like “hard” problems into easy dual
problems.

23.1 Duality of lines and points

Consider a line ` ≡ y = ax + b in two dimensions. It is being parameterized by two constants a and b, which we can interpret,
paired together, as a point in the parametric space of the lines. Naturally, this also gives us a way of interpreting a point as defining
coefficients of a line. Thus, conceptually, points are lines and lines are points.
Formally, the dual point to the line ` ≡ y = ax + b is the point `? = (a, −b). Similarly, for a point p = (c, d) its dual line is
p? = cx − d. Namely,

p = (a, b) ⇒ p? : y = ax − b
` : y = cx + d ⇒ `? = (c, −d).

We will consider a line ` ≡ y = cx + d to be a linear function in one dimension, and let `(x) = cx + d.
A point p = (a, b) lies above a line ` ≡ y = cx + d if p lies vertically above `. Formally, we have that b > `(a) = ca + d. We
will denote this fact by p `. Similarly, the point p lies below ` if b < `(a) = ca + d, denoted by p ≺ `.
A line ` supports a convex set S ⊆ IR2 if it intersects S but the interior of S lies completely on one side of `.

Basic properties. For a point p = (a, b) and a line ` ≡ y = cx + d, we have:

?
(P1) p?? = p? = p.
?
Indeed, p ≡ y = ax − b and p?
?
= (a, −(−b)) = p.
(P2) The point p lies above (resp. below, on) the line `, if and only if the point `? lies above (resp. below, on) the line p? . (Namely,
a point and a line change their vertical ordering in the dual.)
Indeed, p `(a) if and only if b > ca + d. Similarly, (c, −d) = `? > p? ≡ y = ax − b if and only if −d > ac − b, which
is equivalent to the above condition.
(P3) The vertical distance between p and ` is the same as that between p? and `? .
Indeed, the vertical distance between p and ` is |b − `(a)| = |b − (ca + d)|. The vertical distance between `? = (c, −d)
and p? ≡ y = ax − b is |(−d) − p? (c)| = | − d − (ac − b)| = |b − (ca + d)|.
(P4) The vertical distance δ(`, ~) between two parallel lines ` and ~ ≡ y = ax + e is the same as the length of the vertical segment
`? ~? .
The vertical distance between ` and ~ is |b − e|. Similarly, since `? = (a, −b) and ~? = (a, −e) we have that the vertical
distance between them is |(−b) − (−e)| = |b − e|.

183
The missing lines. Consider the vertical line ` ≡ x = 0. Clearly, ` does not have a dual point (specifically, its hypothetical
dual point has an x coordinate with infinite value). In particular, our duality can not handle vertical lines. To visualize the problem,
consider a sequence of non-vertical lines `i that converges to a vertical line `. The sequence of dual points `i? is a sequence of points
that diverges to infinity.

23.1.1 Examples
23.1.1.1 Segments and Wedges

Consider a segment s = pq that lies on a line `. Observe, that the dual of a point r ∈ ` primal dual
is a line r? that passes through the point `? . In fact, the two lines p? and q? define two ` p?
double wedges. Let W be the double wedge that does not contain the vertical line that s q `?
passes through `? . W r?
Consider now the point r as it moves along s. When it is equal to p then its dual line r? r
p
is the line p? . Now, as r moves along the segment s the dual line r? rotates around `? , till it q?
?
arrives to q (and then r reaches q).
What about the other wedge? It represents the two rays forming ` \ s. The vertical line through `? represents the singularity
point in infinity where the two rays are “connected” together. Thus, as r travels along one of the rays (say starting at q) of ` \ s,
the dual line r? becomes steeper and steeper, till it becomes vertical. Now, the point r “jumps” from the “infinite endpoint” of the
ray, to the “infinite endpoint” of the other ray. Simultaneously, the line r? is continuing to rotate from its current vertical position,
sweeping over the whole wedge, till r travels back to p. (The reader that feels uncomfortable with notions line “infinite endpoint”
can rest assured that the author feels the same way. As such, this should be taken as an intuitive description of whats going on and
not a formally correct one.)

23.1.1.2 Convex hull and upper/lower envelopes

Consider a set of lines L in the plane. The minimization diagram of L, known as the lower envelope upper envelope
of L, is the function LL : IR → IR, where we have L(x) = min`∈L `(x), for x ∈ IR. Similarly, the upper
envelope of L is the function U(x) = max`∈L `(x), for x ∈ IR. The extent of L at x ∈ IR is the vertical EL (x)
distance between the upper and lower envelope at x; namely, EL (x) = U(x) − L(x).
Computing the lower and/or upper envelopes can be useful. A line might represent a linear con-
straint, where the feasible solution must lie above this line. Thus, the feasible region is the region of lower envelope
x
points that lie above all the given lines. Namely, the region of the feasible solution is defined by the
upper envelope of the lines. The upper envelope is just a polygonal chain made out of two infinite rays and a sequence of seg-
ments, where each segment/ray lies on one of the given lines. As such, the upper envelop can be described as the sequence of lines
appearing on it, and the vertices where they change.
Developing an efficient algorithm for computing the upper envelope of a set of lines is a tedious but doable task. However, it
becomes trivial if one uses duality.

Lemma 23.1.1 Let L be a set of lines in the plane. Let α ∈ IR be an any number, β− = LL (α) and β+ = UL (α). Let p = (α, β− ) and
q = (α, β+ ). Then:
(i) the dual lines p? and q? are parallel, and they are both perpendicular to the direction (α, −1).
(ii) The lines p? and q? support CH(L? ).
(iii) The extent EL (α) is the vertical distance between the lines p? and q? .

Proof: (i) We have p? ≡ y = αx − β− and q? ≡ y = αx − β+ . These two lines are parallel since they have the same slope. In
particular, they are parallel to the direction (1, α). But this direction is perpendicular to the direction (α, −1).
(ii) By property (P2), we have that all the points of L? are below (or on) the line p? . Furthermore, since p is on the lower
envelope of L it follows that p? must pass through one of the points L? . Namely, p? supports CH(L? ) and it lies above it. Similar
argument applies to q? .
(iii) We have that EL (α) = β+ − β− . The vertical distance between the two parallel lines p? and q? is q? (0) − p? (0) =
−β − (−β− ) = β+ − β− , as required.
+

Thus, consider a vertex p of the upper envelope of the set of lines L. The point p is the ` p
intersection point of two lines ` and ~ of L. Consider the dual set of points L? and the dual CH(L? )
V
line p? . Since p lies above (or on) all the lines of L, by the above discussion, it must be ? ~?
that the line p? lies below (or on) all the points of L? . On the other hand, the line p? passes ~ `? p
through the two points `? and ~? . Namely, p? is a line that supports the convex hull of L? and it passes through two of its vertices.

184
The convex hull of L? is a convex polygon P , which can be broken into two convex chains by breaking upper convex chain
it at the two extreme points in the x direction. We will refer to the upper polygonal chain of the convex hull
as upper convex chain and to the other one as lower convex chain. In particular, two consecutive segments of
the upper envelope corresponds to two consecutive vertices on the lower chain of the convex hull of L? . Thus, q
p
the convex-hull of L? can be decomposed into two chains. The lower chain corresponds to the upper envelope
of L, and the upper chain corresponds to the lower envelope of L. Of special interest are the two x extreme lower convex chain
points p and q of the convex hull. They are the dual of the two lines with the highest/smallest slope in L (we
are assuming here that the slopes of lines in L are distinct). These two lines appear on both the upper and lower
envelope of the lines and they contain the four infinite rays of these envelopes.

q? p?
Lemma 23.1.2 Given a set L of n lines in the plane, one can compute its lower and upper envelopes in
O(n log n) time.

Proof: One can compute the convex hull of n points in the plane in O(n log n) time. Thus, computing the
convex hull of L? and dualizing the upper and lower chains of CH(L? ) results in the required envelopes.

23.2 Higher Dimensions

The above discussion can be easily extended to higher dimensions. We provide the basic properties without further proof, since
they are easy extension of the two dimensional case. A hyperplane h : xd = b1 x1 + · · · + bd−1 xd−1 − bd in IRd can be interpreted as a
function from IRd−1 to IR. Given a point p = (p1 , . . . , pd ) let h(p) = b1 p1 + · · · + bd−1 pd−1 − bd . In particular, a point p lies above the
hyperplane h if pd > h(p). Similarly, p is below the hyperplane h if pd < h(p). Finally, a point is on the hyperplane if h(p) = pd .
The dual of a point p = (p1 , . . . , pd ) ∈ IRd is a hyperplane p? ≡ xd = p1 x1 + · · · pd−1 xd−1 − pd , and the dual of a hyperplane
h ≡ xd = a1 x1 + a2 x2 + · · · + ad−1 xd−1 + ad is the point h? = (a1 , . . . , ad−1 , −ad ). There are several alternative definitions of duality,
but they are essentially similar. Summarizing:

p = (p1 , . . . , pd ) ⇒ p? ≡ xd = p1 x1 + · · · pd−1 xd−1 − pd

h ≡ xd = a1 x1 + a2 x2 + · · · + ad−1 xd−1 + ad ⇒ h? = (a1 , . . . , ad−1 , −ad ).

In the following we would slightly abuse notations, and for a point p ∈ IRd we will refer to (p1 , . . . , pd−1 , LH (p)) as the point
LH (p). Similarly, UH (p) would denote the corresponding point on the upper envelope of H.
The proof of the following lemma is an easy extension of the 2d case.

Lemma 23.2.1 For a point p = (b1 , . . . , bd ), we have:

(i) p?? = p.
(ii) The point p lies above (resp. below, on) the hyperplane h, if and only if the point h? lies above (resp. below, on) the hyperplane
h? .
(iii) The vertical distance between p and h is the same as that between p? and h? .
(iv) The vertical distance δ(h, g) between two parallel hyperplanes h and g is the same as the length of the vertical segment h? g? .
(v) Let H be the set of hyperplanes in IRd . For any x ∈ IRd−1 , the hyperplanes h and g dual to the points LH (p) and UH (p),
respectively, are parallel, normal to the vector (p, −1) ∈ IRd , and supports the set CH(H ? ). Furthermore, the points of H ?
lies below (resp., above) the hyperplane h (resp., above g).
Also, EH (p) is the vertical distance between h and g.
(vi) Computing the lower and upper envelope of H is equivalent to computing the convex hull of the dual set of points H ? .

23.3 Exercises
Exercise 23.3.1 Prove Lemma 23.2.1

Exercise 23.3.2 Show a counter example proving that no duality can preserve (exactly) orthogonal distances between points and
lines.

185
23.4 Bibliographical notes
The duality discussed here should not be confused with linear programming duality [Van97]. Although the two topics seems to be
connected somehow, the author is unaware of a natural and easy connection.
A natural question is whether one can find a duality that preserves the orthogonal distances between lines and points. The
surprising answer is no, as Exercise 23.3.2 testifies. In fact, it is not too hard to show using topological arguments that any duality
must distort such distances arbitrarily bad [FH06].

Open Problem 23.4.1 Given a set P of n points in the plane, and a set L of n lines in the plane, consider the best possible duality
(i.e., the one that minimizes the distortion of orthogonal distances) for P and L. What is the best distortion possible, as a function
of n?
Here, we define the distortion of the duality as
!
d(p, `) d(p? , `? )
max , .
p∈P,`∈L d(p? , `? ) d(p, `)
A striking (negative) example of the power of duality is the work of Overmars and van Leeuwen [OvL81] on the dynamic
maintenance of convex hull in 2d, and the maintenance of the lower/upper envelope of lines in the plane. Clearly, by duality, the
two problems are identical. However, the authors (smart people indeed) did not observe it, and the paper is twice longer than it
should be solving the two problems separately.
Duality is heavily used throughout computational geometry, and it is hard to imagine managing without it. Results and
techniques that use duality include bounds on k-sets/k-levels [Dey98], partition trees [Mat92], and coresets for extent measure
[AHV04] (this is a random short list of relevant results and it is by no means exhaustive).

23.4.1 Projective geometry and duality

The “missing lines phenomena” encountered above is inherent to all dualities, since the space of a lines in the plane has the topology
of an open Möbius strip which is not homeomorphic to the plane. There are a lot of other possible dualities, and the one presented
here is the one most useful for our purposes.
One way to overcome this, is to add an extra coordinate. Now a point is represented by a triplet (w, x, y) represents the planar
point (x/w, y/w) (thus a point no longer has a unique representation). Thus, the triplets (1, 1, 1) and (2, 2, 2) represent the same
point. In fact, a point is essentially a line in three dimensions. Similarly, a line is now represented by a triplet hA, B, Ci which is the
plane Aw + Bx +Cy = 0 that passes through the origin. Duality is now defined in a very natural way. Geometrically, we can interpret
a plane as a point on the sphere, and a point by its representative point on the sphere. Note, that two antipodal points are considered
to be the same. Now, we get a classical projective geometry, where any two distinct lines intersect in a single intersection point and
any two distinct points define a single line.
This homogeneous representation has the beauty that all lines are no represented, and duality is universal. Expressions for
intersection point of two lines no longer involve division, which makes life much easier if one wants to implement geometric
algorithms using exact arithmetic. For example, this representation is currently used by the CGAL project [FGK+ 00] (a software
library implementing basic geometric algorithms). Another advantage is that now that any theorem in the primal, has an immediate
dual theorem in the dual. This is mathematically very elegant.
Geometrically, the points and lines are now triplets in three dimensions (w, x, y). The two dimensional plane, is the plane
having w = 1. The plane h1, 0, 0i, which is parallel to the plane w = 0 represents the line at infinity.
In fact, this duality is still not perfect, since now there natural definition for what is a segment connecting two points disappears
(i.e., there are two portions of a great circle connecting a pair of points, which one is the segment pq). There is an extension of this
notion to add orientation. Thus h1, 1, 1i and h−1, −1, −1i represent different lines. Intuitively, one of them represents one half plane
bounded by this line, and the other represents the other half-plane. Now, if one goes through the details carefully everything falls
into place and you can speak about segments (or precisely oriented segments), and so on.
This whole topic is presented quite nicely in the book by Stolfi [Sto91].

23.4.2 Duality, Voronoi Diagrams and Delaunay Triangulations.

Given a set P of points in IRd its Voronoi diagramVoronoi Diagram is a partition of space into cells, where each cell is the region
closest to one of the points of P. The Delaunay triangulation of a point-set is a planar graph (for d = 2) where two points are
connected by a straight segment if there is a ball that touches both points and its interior is empty. It is easy to verify that these two
structures are dual to each other in the sense of graph duality. Maybe more interestingly, this duality has also an easy geometric
interpretation.
Indeed, given P, its Voronoi diagram boils down to the computation of the lower envelope of cones. This set of cones can
be linearized and then, the computation of the Voronoi diagram boils down to computing the lower envelope of hyperplanes, one
hyperplane for each point of P. Similarly, the computation of the Delaunay triangulation of P can be reduced, after lifting the points

186
to the hyperboloid, to the computation of the convex hull of the points. In fact, the projection down of the lower part of the convex
hull is the required triangulation. Thus, the two structures are dual to each other also lifting/linearization and direct duality. The
interested reader should check out [dBvKOS00].

187
188
Chapter 24

Finite Metric Spaces and Partitions

24.1 Finite Metric Spaces

Definition 24.1.1 A metric space is a pair (X, d) where X is a set and d : X × X → [0, ∞) is a metric, satisfying the following
axioms: (i) d(x, y) = 0 iff x = y, (ii) d(x, y) = d(y, x), and (iii) d(x, y) + d(y, z) ≥ d(x, z) (triangle inequality).

For example, IR2 with the regular Euclidean distance is a metric space.
n It is usually of interest to consider the finite case, where X is an a set of n points. Then, the function d can be specified by
2
real numbers; that is, the distance between every pair of points of X. Alternatively, one can think about (X, d) is a weighted
complete graph, where we specify positive weights on the edges, and the resulting weights on the edges comply with the triangle
inequality.
In fact, finite metric spaces rise naturally from (sparser) graphs. Indeed, let G = (X, E) be an undirected weighted graph
defined over X, and let dG (x, y) be the length of the shortest path between x and y in G. It is easy to verify that (X, dG ) is a finite
metric space. As such if the graph G is sparse, it provides a compact representation to the finite space (X, dG ).

Definition 24.1.2 Let (X, d) be an n-point metric space. We denote the open ball of radius r about x ∈ X, by b(x, r) = y ∈ X d(x, y) < r .

Underling our discussion of metric spaces are algorithmic applications. The hardness of various computational problems
depends heavily on the structure of the finite metric space. Thus, given a finite metric space, and a computational task, it is natural
to try to map the given metric space into a new metric where the task at hand becomes easy.

Example 24.1.3 Consider the problem of computing the diameter, while it is not trivial in two dimensions, it is easy in one
dimension. Thus, if we could map points in two dimensions into points in one dimension, such that the diameter is preserved,
then computing the diameter becomes easy. In fact, this approach yields an efficient approximation algorithm, see Exercise 24.7.3
below.

Of course, this mapping from one metric space to another is going to introduce error. We would be interested in minimizing
the error introduced by such a mapping.

Definition 24.1.4 Let (X, dX ) and (Y, dY ) be metric spaces. A mapping f : X → Y is called an embedding, and is C-Lipschitz if
dY ( f (x), f (y)) ≤ C · dX (x, y) for all x, y ∈ X. The mapping f is called K-bi-Lipschitz if there exists a C > 0 such that
CK −1 · dX (x, y) ≤ dY ( f (x), f (y)) ≤ C · dX (x, y),
for all x, y ∈ X.
The least K for which f is K-bi-Lipschitz is called the distortion of f , and is denoted dist( f ). The least distortion with which
X may be embedded in Y is denoted cY (X).

There are several powerful results in this vain, that show the existence of embeddings with low distortion. These include:
(A) Probabilistic trees. Every finite metric can be randomly embedded into a tree such that the “expected” distortion for a specific
pair of points is O(log n).
(B) Embedding into Euclidean space. Any n-point metric space can be embedded into (finite dimensional) Euclidean space with
O(log n) distortion.
(C) Dimension reduction. Any n-point set in Euclidean space with the regular Euclidean distance can be embedded into IRk with
distortion (1 + ε), where k = O(ε−2 log n).

189
24.2 Examples
What is distortion? When considering a mapping f : X → IRd of a metric space (X, d) to IRd , it would useful to observe
that since IRd can be scaled, we can consider f to be an expansive mapping (i.e., no distances shrink). Furthermore, we can in fact
kx−yk
assume that there is at least one pair of points x, y ∈ X, such that d(x, y) = kx − yk. As such, we have dist( f ) = max x,y d(x,y) .

Why distortion is necessary? Consider the a graph G = (V, E) with one vertex s connected b
s
to three other vertices a, b, c, where the weights on the edges are all one (i.e., G is the star√graph with a
three leaves). We claim that G can not be embedded into Euclidean space with distortion ≤ 2. Indeed,
consider the associated metric space (V, dG ) and an (expansive) embedding f : V → IRd .
Let 4 denote the triangle formed by a0 b0 c0 , where a0 = f (a), b0 = f (b) and c0 = f (c). Next, consider c
the following quantity max(ka0 − s0 k , kb0 − s0 k , kc0 − s0 k) which lower bounds the distortion of f . This
quantity is minimized when r = ka0 − s0 k = kb0 − s0 k = kc0 − s0 k. Namely, s0 is the center of the smallest enclosing circle of
4. However, r is√minimized when all the edges of 4 are of equal length, and are in fact of length dG (a, b) = 2. It follows that
dist( f ) ≥ r = 2/ 3.
It is known that Ω(log n) distortion is necessary in the worst case when embedding a graph into euclidean space. This is shown
using expanders [Mat02].

24.2.1 Hierarchical Tree Metrics

The following metric is quite useful in practice, and nicely demonstrate why algorithmically finite metric spaces are useful.

Definition 24.2.1 Hierarchically well-separated tree (HST) is a metric space defined on the leaves of a rooted tree T . To each
vertex u ∈ T there is associated a label ∆u ≥ 0. This label is zero for all the leaves of T , and it is a positive number for all the
interior nodes. The labels are such that if a vertex u is a child of a vertex v then ∆u ≤ ∆v . The distance between two leaves x, y ∈ T
is defined as ∆lca(x,y) , where lca(x, y) is the least common ancestor of x and y in T .
A HST T is a k-HST if for a vertex v ∈ T , we have that ∆v ≤ ∆p(v) /k, where p(v) is the parent of v in T .

Note that a HST is a very limited metric. For example, consider the cycle G = Cn of n vertices, with weight one on the edges,
and consider an expansive embedding f of G into a HST H. It is easy to verify, that there must be two consecutive nodes of the
cycle, which are mapped to two different subtrees of the root r of H. Since H is expansive, it follows that ∆r ≥ n/2. As such,
dist( f ) ≥ n/2. Namely, HSTs fail to faithfully represent even very simple metrics.

24.2.2 Clustering
One natural problem we might want to solve on a graph (i.e., finite metric space) (X, d) is to partition it into clusters. One such
natural clustering is the k-median clustering, where we would like to choose a set C ⊆ X of k centers, such that
X
νC (X, d) = d(q, C)
q∈X

is minimized, where d(q, C) = minc∈C d(q, c) is the distance of q to its closest center in C.
It is known that finding the optimal k-median clustering in a (general weighted) graph is NP-complete. As such, the best we
can hope for is an approximation algorithm. However, if the structure of the finite metric space (X, d) is simple, then the problem
can be solved efficiently. For example, if the points of X are on the real line (and the distance between a and b is just |a − b|), then
k-median can be solved using dynamic programming.
Another interesting case is when the metric space (X, d) is a HST. Is not too hard to prove the following lemma. See Exer-
cise 24.7.1.

Lemma 24.2.2 Let (X, d) be a HST defined over n points, and let k > 0 be an integer. One can compute the optimal k-median
clustering of X in O(k2 n) time.

Thus, if we can embed a general graph G into a HST H, with low distortion, then we could approximate the k-median clustering
on G by clustering the resulting HST, and “importing” the resulting partition to the original space. The quality of approximation,
would be bounded by the distortion of the embedding of G into H.

190
24.3 Random Partitions
Let (X, d) be a finite metric space. Given a partition P = {C1 , . . . , Cm } of X, we refer to the sets Ci as clusters. We write PX for the
set of all partitions of X. For x ∈ X and a partition P ∈ PX we denote by P(x) the unique cluster of P containing x. Finally, the set
of all probability distributions on PX is denoted DX .

What we want, and what we can get. The target is to partition the metric space into clusters, such that each cluster
would have diameter at most ∆, for some prespecified parameter ∆.
We would like to have a partition that does not disrupt distances too “badly”. Intuitively, that means that a pair of points that
is in distance larger than ∆ will be separated by the clustering, but points that are closer to each other would be in the same cluster.
This is of course impossible, as any clustering must separate points that are close to each other. To see that, consider a set of points
densely packed on the interval [0, 10], and let ∆ < 5. Clearly, there would always be two close points that would be in two separate
clusters.
As such, our strategy would be to use partitions that are constructed randomly, and the best we can hope for, is that the
probability of them being separated is a function of their distance t, which would be small if t is small. As an example, for the
case of points on the real line, take the natural partition into intervals of length ∆ (that is, all the points in the interval [i∆, (i + 1)∆)
would belong to the same cluster), and randomly shift it by a random number x picked uniformly in [0, ∆). Namely, all the points
belonging to [x + i∆, x + (i + 1)∆) would belong to the same cluster. Now, it is easy to verify that for any two points p, q ∈ IR, of
distance t = |p − q| from each other, the probability that they are in two different intervals is bounded by t/∆ (see Exercise 24.7.4).
And intuitively, this is the best one can hope for.
As such, the clustering scheme we seek should separate two points in distance t from each other with probability (t/∆) ∗ noise,
where noise is hopefully small.

24.3.1 Constructing the partition

u
Let ∆ = 2 be a prescribed parameter, which is the required diameter of the resulting clusters. Choose, uniformly at random, a
permutation π of X and a random value α ∈ [1/4, 1/2]. Let R = α∆, and observe that it is uniformly distributed in the interval
[∆/4, ∆/2].
The partition is now defined as follows: A point x ∈ X is assigned to the cluster Cy of y, where y is the first point in the
permutation in distance ≤ R from x. Formally,

Cy = x ∈ X x ∈ b(y, R) and π(y) ≤ π(z) for all z ∈ X with x ∈ b(z, R) .

Let P = {Cy }y∈X denote the resulting partition.

Here is a somewhat more intuitive explanation: Once we fix the radius of the clusters R, we start scooping out balls of radius
R centered at the points of the random permutation π. At the ith stage, we scoop out only the remaining mass at the ball centered at
xi of radius r, where xi is the ith point in the random permutation.

24.3.2 Properties
Lemma 24.3.1 Let (X, d) be a finite metric space, ∆ = 2u a prescribed parameter, and let P be the partition of X generated by the
above random partition. Then the following holds:
(i) For any C ∈ P, we have diam(C) ≤ ∆.
(ii) Let x be any point of X, and t a parameter ≤ ∆/8. For B = b(x, t), we have that
8t M
Pr[B * P(x)] ≤ ln ,
∆ m
where M = |b(x, ∆/8)|, m = |b(x, ∆)|.

Proof: Since Cy ⊆ b(y, R), we have that diam(Cy ) ≤ ∆, and thus the first claim holds.
Let U be the set of points w ∈ b(x, ∆) such that b(w, R) ∩ B , ∅. Arrange the points of U in increasing distance from x, and
let w1 , . . . , w M0 denote the resulting order, where M 0 = |U|. For k = 1, . . . , M 0 , let Ik = [d(x, wk ) − t, d(x, wk ) + t] and write Ek for
the event that wk is the first point in π such that B ∩ Cwk , ∅, and yet B * Cwk . Observe that if B * P(x) then one of he events
E1 , . . . , E M0 must occur.
Note that if wk ∈ b(x, ∆/8), then Pr[Ek ] = 0 since t ≤ ∆/8 and B = b(x, t) ⊆ b(x, ∆/8) ⊆ b(wk , ∆/4) ⊆ b(wk , R). Indeed, when
we “scoop” out the cluster Cwk , either B would be fully contained inside Cwk , or alternatively, if B is not fully contained inside Cwk ,
then some parts of B were already “scooped out” by some other point of U, and as such Ek does not happen.

191
In particular, w1 , . . . , wm are inside b(x, ∆/8) and as such Pr[E1 ] = · · · = Pr[Ea ] = 0. Also, note that if d(x, wk ) < R − t then
b(wk , R) contains B and as such Ek can not happen. Similarly, if d(x, wk ) > R + t then b(wk , R) ∩ B = ∅ and Ek can not happen. As
such, if Ek happen then R − t ≤ d(x, wk ) ≤ R + t. Namely, if Ek happen then R ∈ Ik . We conclude that
Pr[Ek ] = Pr[Ek ∩ (R ∈ Ik )] = Pr[R ∈ Ik ] · Pr[Ek | R ∈ Ik ] .
Now, R is uniformly distributed in the interval [∆/4, ∆/2], and Ik is an interval of length 2t. Thus, Pr[R ∈ Ik ] ≤ 2t/(∆/4) = 8t/∆.
Next, to bound Pr[Ek | R ∈ Ik ], we observe that w1 , . . . , wk−1 are closer to x than wk and their distance to b(x, t) is smaller than
R. Thus, if any of them appear before wk in π then Ek does not happen. Thus, Pr[Ek | R ∈ Ik ] is bounded by the probability that wk
is the first to appear in π out of w1 , . . . , wk . But this probability is 1/k, and thus Pr[Ek | R ∈ Ik ] ≤ 1/k.
We are now ready for the kill. Indeed,
M0
X M0
X M0
X
Pr[B * P(x)] = Pr[Ek ] = Pr[Ek ] = Pr[R ∈ Ik ] · Pr[Ek | R ∈ Ik ]
k=1 k=m+1 k=m+1

XM0
8t 1 8t M 0 8t M
≤ · ≤ ln ≤ ln ,
k=m+1
∆ k ∆ m ∆ m
PM0 1
R M 0 dx M0
since k=m+1 k ≤ m x
= ln m
and M 0 ≤ M.

24.4 Probabilistic embedding into trees

In this section, given n-point finite metric (X, d). we would like to embed it into a HST. As mentioned above, one can verify that
for any embedding into HST, the distortion in the worst case is Ω(n). Thus, we define a randomized algorithm that embed (X, d)
into a tree. Let T be the resulting tree, and consider two points x, y ∈ X. Consider the random variable dT (x, y). We constructed
h (x,y)the
i
tree T such that distances never shrink; i.e. d(x, y) ≤ dT (x, y). The probabilistic distortion of this embedding is max x,y E dd(x,y)
T
.
Somewhat surprisingly, one can find such an embedding with logarithmic probabilistic distortion.

Theorem 24.4.1 Given n-point metric (X, d) one can randomly embed it into a 2-HST with probabilistic distortion ≤ 24 ln n.

Proof: The construction is recursive. Let diam(P), and compute a random partition of X with cluster diameter diam(P)/2,
using the construction of Section 24.3.1. We recursively construct a 2-HST for each cluster, and hang the resulting clusters on the
root node v, which is marked by ∆v = diam(P). Clearly, the resulting tree is a 2-HST.
For a node v ∈ T , let X(v) be the set of points of X contained in the subtree of v.
For the analysis, assume diam(P) = 1, and consider two points x, y ∈ X. We consider a node v ∈ T to be in level i if

level(v) = lg ∆v = i. The two points x and y correspond to two leaves in T , and let b u be the least common ancestor of x and y in t.
We have dT (x, y) ≤ 2level(v) . Furthermore, note that along a path the levels are strictly monotonically increasing.
In fact, we are going to be conservative, and let w be the first ancestor of x, such that b = b(x, d(x, y)) is not completely

u . Thus, dT (x, y) ≤ 2level(w) .
contained in X(u1 ), . . . , X(um ), where u1 , . . . , um are the children of w. Clearly, level(w) > level b
Consider the path σ from the root of T to x, and let Ei be the event that b is not fully contained in X(vi ), where vi is the node
of σ of level i (if such a node exists). Furthermore, let Yi be the indicator variable which is 1 if Ei is the first to happened out of the
P i
sequence of events E0 , E−1 , . . .. Clearly, dT (x, y) ≤ Yi 2.

Let t = d(x, y) and j = lg d(x, y) , and ni = b(x, 2i ) for i = 0, . . . , −∞. We have

X X h i X
0 0 0
8t ni
E dT (x, y) ≤ E[Yi ] 2i ≤ 2i Pr Ei ∩ Ei−1 ∩ Ei−1 · · · E0 ≤ 2i · i ln ,
i= j i= j i= j
2 ni−3

by Lemma 24.3.1. Thus,  0 

Y ni 
E dT (x, y) ≤ 8t ln   ≤ 8t ln(n0 · n1 · n2 ) ≤ 24t ln n.
ni−3 
i= j

It thus follows, that the expected distortion for x and y is ≤ 24 ln n.

24.4.1 Application: approximation algorithm for k-median clustering

Let (X, d) be a n-point metric space, and let k be an integer number. We would like to compute the optimal k-median clustering.
Number, find a subset Copt ⊆ X, such that the price of the clustering νCopt (X, d) is minimized, see Section 24.2.2. To this end,
we randomly embed (X, d) into a HST H using Theorem 24.4.1. Next, using Lemma 24.2.2, we compute the optimal k-median
clustering of H. Let C be the set of centers computed. We return C together with the partition of X it induces as the required
clustering.

192
Theorem 24.4.2 Let (X, d) be a n-point metric space. One can compute in polynomial time a k-median clustering of X which has

expected price O α log n , where α is the price of the optimal k-median clustering of (X, d).

Proof: The algorithm is described above, and the fact that its running time is polynomial can be easily be verified. To prove
the bound on the quality of the clustering, for any point p ∈ X, let c(p) denote the closest point in Copt to p according to d, where
Copt is the set of k-medians in the optimal clustering. Let C be the set of k-medians returned by the algorithm, and let H be the HST
used by the algorithm. We have
X X
β = νC (X, d) ≤ νC (X, dH ) ≤ νCopt (X, dH ) ≤ dH (p, Copt ) ≤ dH (p, c(p)).
p∈X p∈X

Thus, in expectation we have

 
X  X X
Eβ = E dH (p, c(p)) = E dH (p, c(p)) = O d(p, c(p)) log n
p∈X p∈X p∈X
 
 X 
= O(log n) d(p, c(p)) = O νCopt (X, d) log n ,
p∈X

by linearity of expectation and Theorem 24.4.1.

24.5 Embedding any metric space into Euclidean space

Lemma 24.5.1 Let (X, d) be a metric, and let Y ⊂ X. Consider the mapping f : X → IR, where f (x) = d(x, Y) = miny∈Y d(x, y).
Then for any x, y ∈ X, we have | f (x) − f (y)| ≤ d(x, y). Namely f is nonexpansive.

Proof: Indeed, let x0 and y0 be the closet points of Y, to x and y, respectively. Observe that f (x) = d(x, x0 ) ≤ d(x, y0 ) ≤
d(x, y) + d(y, y0 ) = d(x, y) + f (y) by the triangle inequality. Thus, f (x) − f (y) ≤ d(x, y). By symmetry, we have f (y) − f (x) ≤ d(x, y).
Thus, | f (x) − f (y)| ≤ d(x, y).

24.5.1 The bounded spread case

diam(X)
Let (X, d) be a n-point metric. The spread of X, denoted by Φ(X) = min x,y∈X,x,y d(x,y)
, is the ratio between the diameter of X and the
distance between the closest pair of points.

Theorem
√ 24.5.2 Given a n-point metric Y = (X, d), with spread Φ, one can embed it into Euclidean space IRk with distortion
O( ln Φ ln n), where k = O(ln Φ ln n).

Proof: Assume that diam(Y) = Φ (i.e., the smallest distance in Y is 1), and let ri = 2i−2 , for i = 1, . . . , α, where α = lg Φ . Let

Pi, j be a random partition of P with diameter ri , using Theorem 24.4.1, for i = 1, . . . , α and j = 1, . . . , β, where β = c log n and c
is a large enough constant to be determined shortly.
For each cluster of Pi, j randomly toss a coin, and let Vi, j be the all the points of X that belong to clusters in Pi, j that got ’T ’ in
their coin toss. For a point u ∈ x, let fi, j (x) = d(x, X \ Vi, j ) = minv∈X\Vi, j d(x, v), for i = 0, . . . , m and j = 1, . . . , β. Let F : X →

IR(m+1)·β be the embedding, such that F(x) = f0,1 (x), f0,2 (x), . . . , f0,β (x), f1,1 (x), f0,2 (x), . . . , f1,β (x), . . . , fm,1 (x), fm,2 (x), . . . , fm,β (x) .
Next, consider two points x, y ∈ X, with distance φ = d(x, y). Let k be an integer such that ru ≤ φ/2 ≤ ru+1 . Clearly, in any
partition of Pu,1 , . . . , Pu,β the points x and y belong to different clusters. Furthermore, with probability half x ∈ Vu, j and y < Vu, j or
x < Vu, j and y ∈ Vu, j , for 1 ≤ j ≤ β.
Let E j denote the event that b(x, ρ) ⊆ Vu, j and y < Vu, j , for j = 1, . . . , β, where ρ = φ/(64 ln n). By Lemma 24.3.1, we have
h i 8ρ φ
Pr b(x, ρ) * Pu, j (x) ≤ ln n ≤ ≤ 1/2.
ru 8ru
Thus,
h i
Pr E j = Pr b(x, ρ) ⊆ Pu, j (x) ∩ x ∈ Vu, j ∩ y < Vu, j
h i h i h i
= Pr b(x, ρ) ⊆ Pu, j (x) · Pr x ∈ Vu, j · Pr y < Vu, j ≥ 1/8,
since those three events are independent. Notice, that if E j happens, than fu, j (x) ≥ ρ and fu, j (y) = 0. hP i
P
Let X j be an indicator variable which is 1 if Ei happens, for j = 1, . . . , β. Let Z = j X j , and we have µ = E[Z] = E j X j ≥
β/8. Thus, the probability that E1 , . . . , Eβ happens, is Pr[Z < (1 − 1/2) E[Z]]. By the Chernoff inequality, we have
only β/16 of
Pr[Z < (1 − 1/2) E [Z]] ≤ exp −µ1/(2 · 2 ) = exp(−β/64) ≤ 1/n10 , if we set c = 640.
2

193
Thus, with high probability
v
u
t r
X
β
2 √
β p ρ β
kF(x) − F(y)k ≥ fu, j (x) − fu, j (y) ≥ ρ2 = β =φ· .
j=1
16 4 256 ln n

On the other hand, fi, j (x) − fi, j (y) ≤ d(x, y) = φ ≤ 64ρ ln n. Thus,
q p p
kF(x) − F(y)k ≤ αβ(64ρ ln n)2 ≤ 64 αβρ ln n = αβ · φ.

Thus, setting G(x) = F(x) 256√lnβ n , we get a mapping that maps two points of distance φ from each other to two points with
√ √ √
distance in the range φ, φ · αβ · 256√lnβ n . Namely, G(·) is an embedding with distortion O( α ln n) = O( ln Φ ln n).

The probability that G fails on one of the pairs, is smaller than (1/n10 ) · n2 < 1/n8 . In particular, we can check the distortion

of G for all n2 pairs, and if any of them fail (i.e., the distortion is too big), we restart the process.

24.5.2 The unbounded spread case

Our next task, is to extend Theorem 24.5.2 to the case of unbounded spread. Indeed, let (X, d) be a n-point metric, such that
diam(X) ≤ 1/2. Again, we look on the different resolutions r1 , r2 , . . ., where ri = 1/2i−1 . For each one of those resolutions ri , we
can embed this resolution into β coordinates, as done for the bounded case. Then we concatenate the coordinates together.
There are two problems with this approach: (i) the number of resulting coordinates is infinite, and (ii) a pair x, y, might be
distorted a “lot” because it contributes to all resolutions, not only to its “relevant” resolutions.
Both problems can be overcome with careful tinkering. Indeed, for a resolution ri , we are going to modify the metric, so
that it ignores short distances (i.e., distances ≤ ri /n2 ). Formally, for each resolution ri , let Gi = (X, E bi ) be the graph where two
points x and y are connected if d(x, y) ≤ ri /n2 . Consider a connected component C ∈ Gi . For any two points x, y ∈ C, we have
d(x, y) ≤ n(ri /n2 ) ≤ ri /n. Let Xi be the set of connected components of Gi , and define the distances between two connected
components C, C 0 ∈ Xi , to be di (C, C 0 ) = d(C, C 0 ) = minc∈C,c0 ∈C 0 d(c, c0 ).
It is easy to verify that (Xi , di ) is a metric space (see Exercise 24.7.2). Furthermore, we can naturally embed (X, d) into
(Xi , di ) by mapping a point x ∈ X to its connected components in Xi . Essentially (Xi , di ) is a snapped version of the metric
(X, d), with the advantage that Φ((X, di )) = O(n2 ). We now embed Xi into β = O(log n) coordinates. Next, for any point of X
we embed it into those β coordinates, by using the embedding of its connected component in Xi . Let Ei be the embedding for
resolution ri . Namely, Ei (x) = ( fi,1 (x), fi,2 (x), . . . , fi,β (x)), where fi, j (x) = min(di (x, X \ Vi, j ), 2ri ). The resulting embedding is
F(x) = ⊕Ei (x) = (E1 (x), E2 (x), . . . , ).
Since we slightly modified the definition of fi, j (·), we have to show that fi, j (·) is nonexpansive. Indeed, consider two points
x, y ∈ Xi , and observe that
fi, j (x) − fi, j (y) ≤ di (x, Vi, j ) − di (y, Vi, j ) ≤ di (x, y) ≤ d(x, y),
as a simple case analysis¯ shows.
For a pair x, y ∈ X, and let φ = d(x, y). To see that F(·) is the required embedding (up to scaling), observe that, by the same
argumentation of Theorem 24.5.2, we have that with high probability
√
β
kF(x) − F(y)k ≥ φ · .
256 ln n
To get an upper bound on this distance, observe that for i such that ri > φn2 , we have Ei (x) = Ei (y). Thus,
X X
kF(x) − F(y)k2 = kEi (x) − Ei (y)k2 = kEi (x) − Ei (y)k2
i i,ri <φn2
X X
= kEi (x) − Ei (y)k2 + kEi (x) − Ei (y)k2
i,φ/n2 <ri <φn2 i,ri <φ/n2
X 4φ2 β
= βφ2 lg n4 + (2ri )2 β ≤ 4βφ2 lg n + 4 ≤ 5βφ2 lg n.
2
n
i,ri <φ/n

p
Thus, kF(x) − F(y)k ≤ φ 5β lg n. We conclude, that with high probability, F(·) is an embedding of X into Euclidean space with
p √
distortion φ 5β lg n / φ · 256 lnβ n = O(log3/2 n).

¯
Indeed, if fi, j (x) < di (x, Vi, j ) and fi, j (y) < di (x, Vi, j ) then fi, j (x) = 2ri and fi, j (y) = 2ri , which implies the above inequality. If
fi, j (x) = di (x, Vi, j ) and fi, j (y) = di (x, Vi, j ) then the inequality trivially holds. The other option is handled in a similar fashion.

194
We still have to handle the infinite number of coordinates problem. However, the above proof shows that we care about a
resolution ri (i.e., it contributes to the estimates in the above proof) only if there is a pair x and y such that ri /n2 ≤ d(x, y) ≤ ri n2 .
Thus, for every pair of distances there are O(log n) relevant resolutions. Thus, there are at most η = O(n2 β log n) = O(n2 log2 n)
relevant coordinates, and we can ignore all the other coordinates. Next, consider the affine subspace h that spans F(P). Clearly, it
is n − 1 dimensional, and consider the projection G : IRη → IRn−1 that projects a point to its closest point in h. Clearly, G(F(·)) is an
embedding with the same distortion for P, and the target space is of dimension n − 1.
Note, that all this process succeeds with high probability. If it fails, we try again. We conclude:

Theorem 24.5.3 (Low quality Bourgain theorem.) Given a n-point metric M, one can embed it into Euclidean space of dimen-
sion n − 1, such that the distortion of the embedding is at most O(log3/2 n).

Using the Johnson-Lindenstrauss lemma, the dimension can be further reduced to O(log n). In fact, being more careful in the
proof, it is possible to reduce the dimension to O(log n) directly.

24.6 Bibliographical notes

The partitions we use are due to Calinescu et al. [CKR01]. The idea of √
embedding
into spanning trees is due to Alon et al.
O log n log log n
[AKPW95], which showed that one can get a probabilistic distortion of 2 . Yair Bartal realized that by allowing trees
with additional vertices, one can get a considerably better result. In particular, he showed [Bar96] that probabilistic embedding
into trees can be done with polylogarithmic average distortion. He later improved the distortion to O(log n log log n) in [Bar98].
Improving this result was an open question, culminating in the work of Fakcharoenphol et al. [FRT03] which achieve the optimal
O(log n) distortion.
Interestingly, if one does not care about the optimal distortion, one can get similar result (for embedding into probabilistic
trees), by first embedding the metric into Euclidean space, then reduce the dimension by the Johnson-Lindenstrauss lemma, and
finally, construct an HST by constructing a quadtree over the points. The “trick” is to randomly translate the quadtree. It is easy to
verify that this yields O(log4 n) distortion. See the survey by Indyk [Ind01] for more details. This random shifting of quadtrees is a
powerful technique that was used in getting several result, and it is a crucial ingredient in Arora [Aro98] approximation algorithm
for Euclidean TSP.
Our proof of Lemma 24.3.1 (which is originally from [FRT03]) is taken from [KLMN04]. The proof of Theorem 24.5.3 is by
Gupta [Gup00].
A good exposition of metric spaces is available in Matoušek [Mat02].

24.7 Exercises
Exercise 24.7.1 (Clustering for HST.) [4 Points]
Let (X, d) be a HST defined over n points, and let k > 0 be an integer. Provide an algorithm that computes the optimal k-median
clustering of X in O(k2 n) time.
[Hint: Transform the HST into a tree where every node has only two children. Next, run a dynamic programming algorithm
on this tree.]

Exercise 24.7.2 (Partition induced metric.) [10 Points]

(A) [2 Points] Give a counter example to the following claim: Let (X, d) be a metric space, and let P be a partition of X. Then,
the pair (P, d0 ) is a metric, where d0 (C, C 0 ) = d(C, C 0 ) = min x∈C,y∈C 0 d(x, y) and C, C 0 ∈ P.

(B) [8 Points] Let (X, d) be a n-point metric space, and consider the set U = i 2i ≤ d(x, y) ≤ 2i+1 , for x, y ∈ X . Prove that
|U| = O(n). Namely, there are only n different resolutions that “matter” for a finite metric space.

Exercise 24.7.3 (Computing the diameter via embeddings.) [7 Points]

(A) [1 Points] Let ` be a line in the plane, and consider the embedding f : IR2 → `, which is the projection of the plane into `.
Prove that f is 1-Lipschitz, but it is not K-bi-Lipschitz for any constant K.
√
(B) [3 Points] Prove that one can find a family of projections F of size O(1/ ε), such that for any two points x, y ∈ IR2 , for one of
the projections f ∈ F we have d( f (x), f (y)) ≥ (1 − ε)d(x, y).
√
(C) [1 Points] Given a set P of n in the plane, given a O(n/ ε) time algorithm that outputs two points x, y ∈ P, such that
d(x, y) ≥ (1 − ε)diam(P), where diam(P) = maxz,w∈P d(z, w) is the diameter of P.

195
(D) [2 Points] Given P, show how to extract, in O(n) time, a set Q ⊆ P of size O(ε−2 ), such that diam(Q) ≥ (1 − ε/2)diam(P).
(Hint: Construct a grid of appropriate resolution.)
In particular, give an (1 − ε)-approximation algorithm to the diameter of P that works in O(n + ε−2.5 ) time. (There are slightly
faster approximation algorithms known for approximating the diameter.)

Exercise 24.7.4 (Partitions in Euclidean Space.) [10 Points]

(A) [1 Points] For a real number ∆ > 0, and a random number x ∈ [0, ∆] consider the random partition of the real line into
intervals of length ∆, such that all the points falling into the interval [i∆, i∆ + x) are in the same cluster. Prove, that for two
points, p, q ∈ IR, the probability of p and q to be in two different clusters is at most |p − q| /∆.
(B) [3 Points] Consider the d dimensional grid of sidelength ∆, and let p be a random vector in the hypercube [0, ∆]d . Shift the
grid by p, and consider the partition of IRd induced by this grid. Formally, the space is partitioned into clusters, where all the
points inside the cube p + [0, ∆)d are one cluster. Consider any two points q, r ∈ IRd . Prove, that the probability that q and r are
in different clusters is bounded by d kq − rk /∆.
√
(C) [6 Points] Strengthen (B), by showing that the probability is bounded by d kq − rk /∆. [Hint: Consider the distance t =
kq − rk to be fixed, and figure out what is the worst case for this partition.]
√
Part (C) implies that we can partition space into clusters with diameter ∆0 = d∆ such that the probability of points in distance
t from each other to be separated is bounded by dt/∆0 .

Acknowledgments
The presentation in this write-up follows closely the insightful suggestions of Manor Mendel.

196
Chapter 25

Tail Inequalities

"Wir müssen wissen, wir werden wissen" (We must know, we shall know)
—– David Hilbert

25.1 Markov Inequality

Theorem 25.1.1 (Markov Inequality) For a non-negative variable X, and t > 0, we have:
E[X]
Pr[X ≥ t] ≤ .
t
Proof: Assume that this is false, and there exists t0 > 0 such that Pr[X ≥ t0 ] > E[X]
t0
. However,
X X X
E[X] = x · Pr[X = x] = x · Pr[X = x] + x · Pr[X = x]
x x<t0 x≥t0

E[X]
≥ 0 + t0 · Pr[X ≥ t0 ] > 0 + t0 · = E[X] ,
t0
a contradiction.

Theorem 25.1.2 (Chebychev inequality) Let X be a random variable with µ x = E[X] and σ x be the standard deviation of X. That
h i 1
is σ2X = E (X − µ x )2 . Then, Pr |X − µX | ≥ tσX ≤ 2 .
t
Proof: Note that h i

Pr |X − µX | ≥ tσX = Pr (X − µX )2 ≥ t2 σ2X .

Set Y = (X − µX )2 . Clearly, E Y = σ2X . Now, apply Markov inequality to Y.

Definition 25.1.3 Variables X, Y are independent if for any x, y we have:

Pr (X = x) ∩ (Y = y) = Pr[X = x] · Pr Y = y .
The following is easy to verify:

Claim 25.1.4 If X and Y are independent, then E[XY] = E[X] E[Y].

If X and Y are independent then Z = eX , W = eY are also independent variables.

25.2 Tail Inequalities

25.2.1 The Chernoff Bound — Special Case
Theorem 25.2.1 Let X1 , . . . , Xn be n independent random variables, such that Pr[Xi = 1] = Pr[Xi = −1] = 21 , for i = 1, . . . , n. Let
P
Y = ni=1 Xi . Then, for any ∆ > 0, we have
2
Pr[Y ≥ ∆] ≤ e−∆ /2n .

197
Proof: Clearly, for an arbitrary t, to specified shortly, we have

E exp(tY)
Pr[Y ≥ ∆] = Pr exp(tY) ≥ exp(t∆) ≤ ,
exp(t∆)
the first part follows by the fact that exp(·) preserve ordering, and the second part follows by the Markov inequality.
Observe that
1 t 1 −t et + e−t
E exp(tXi ) = e + e =
2 2 2
!
1 t t2 t3
= 1+ + + + ···
2 1! 2! 3!
!
1 t t2 t3
+ 1− + − + ···
2 1! 2! 3!
!
t2 t2k
= 1+ + + +··· + + ··· ,
2! (2k)!

by the Taylor expansion of exp(·). Note, that (2k)! ≥ (k!)2k , and thus
!i
X t2i X t2i X 1 t2
∞ ∞ ∞

E exp(tXi ) = ≤ = = exp t2 /2 ,
i=0
(2i)! i=0
2i (i!)
i=0
i! 2

again, by the Taylor expansion of exp(·). Next, by the independence of the Xi s, we have
    
 X  Y  Yn

  
E exp(tY) = Eexp tXi  = E  exp(tXi ) = E exp(tXi )
i i i=1
Y
n
2 /2 2 /2
≤ et = ent .
i=1

We have
exp nt2 /2
Pr[Y ≥ ∆] ≤ = exp nt2 /2 − t∆ .
exp(t∆)
Next, by minimizing the above quantity for t, we set t = ∆/n. We conclude,
 !  !
 n ∆ 2 ∆  ∆2
Pr[Y ≥ ∆] ≤ exp − ∆ = exp − .
2 n n 2n

By the symmetry of Y, we get the following:

Corollary 25.2.2 Let X1 , . . . , Xn be n independent random variables, such that Pr[Xi = 1] = Pr[Xi = −1] = 21 , for i = 1, . . . , n. Let
P
Y = ni=1 Xi . Then, for any ∆ > 0, we have
2
Pr[|Y| ≥ ∆] ≤ 2e−∆ /2n .

1
Corollary 25.2.3 Let X1 , . . . , Xn be n independent coin flips, such that Pr[Xi = 0] = Pr[Xi = 1] = , for i = 1, . . . , n. Let
P 2
Y = ni=1 Xi . Then, for any ∆ > 0, we have
n

Pr Y − ≥ ∆ ≤ 2e−2∆ /n .
2

Remark 25.2.4 Before going any further, it is might be instrumental to understand what this inequalities
√ √ imply. Consider then
case where Xi is either zero or one with probability half. In this case µ = E[Y] = n/2. Set δ = t n ( µ is approximately the
standard deviation of X if pi = 1/2). We have by
n
√
Pr Y − ≥ ∆ ≤ 2 exp −2∆2 /n = 2 exp −2(t n)2 /n = 2 exp −2t2 .
2
Thus, Chernoff inequality implies exponential decay (i.e., ≤ 2−t ) with t standard deviations, instead of just polynomial (like the
Chebychev’s inequality).

198
25.2.2 The Chernoff Bound — General Case
Here we present the Chernoff bound in a more general settings.

Question 25.2.5 Let

1. X1 , . . . , Xn - n independent Bernoulli trials, where
Pr[Xi = 1] = pi , and Pr[Xi = 0] = qi = 1 − pi .
Each Xi is known as a Poisson trials.
P P
2. X = bi=1 Xi . µ = E[X] = i pi .
Question: Probability that X > (1 + δ)µ?
!µ
eδ
Theorem 25.2.6 For any δ > 0, we have Pr X > (1 + δ)µ < .
(1 + δ)1+δ
Or in a more simplified form, for any δ ≤ 2e − 1,

Pr X > (1 + δ)µ < exp −µδ2 /4 , (25.1)
and

Pr X > (1 + δ)µ < 2−µ(1+δ) , (25.2)
for δ ≥ 2e − 1.
h i
Proof: We have Pr X > (1 + δ)µ = Pr etX > et(1+δ)µ . By the Markov inequality, we have:
h i
tX
Ee
Pr X > (1 + δ)µ < t(1+δ)µ
e
On the other hand, h i h i h i h i
E etX = E et(X1 +X2 ...+Xn ) = E etX1 · · · E etXn .
Namely, h i
Qn Qn Qn
i=1 E etXi 0
i=1 (1 − pi )e + pi e
t
i=1 1 + pi (et − 1)
Pr X > (1 + δ)µ < = = .
et(1+δ)µ et(1+δ)µ et(1+δ)µ
Let y = pi (et − 1). We know that 1 + y < ey (since y > 0). Thus,
Qn P
t
i=1 exp(pi (e − 1)) exp ni=1 pi (et − 1)
Pr X > (1 + δ)µ < . =
et(1+δ)µ et(1+δ)µ
t Pn t !µ
exp (e − 1) i=1 pi exp (e − 1)µ exp et − 1
= = =
et(1+δ)µ et(1+δ)µ et(1+δ)
!µ
exp(δ)
= ,
(1 + δ)(1+δ)
if we set t = log(1 + δ).
For the proof of the simplified form, see Section 25.2.3.
µ
δ
Definition 25.2.7 F + (µ, δ) = (1+δ)e (1+δ) .

Example 25.2.8 Arkansas Aardvarks win a game with probability 1/3. What is their probability to have a winning season with n
games. By Chernoff inequality, this probability is smaller than
" 1/2 #n/3
e
F + (n/3, 1/2) = = (0.89745)n/3 = 0.964577n .
1.51.5
For n = 40, this probability is smaller than 0.236307. For n = 100 this is less than 0.027145. For n = 1000, this is smaller than
2.17221 · 10−16 (which is pretty slim and shady). Namely, as the number of experiments is increases, the distribution converges to
its expectation, and this converge is exponential.

Theorem 25.2.9 Under the same assumptions as Theorem 25.2.6, we have:

2
Pr X < (1 − δ)µ < e−µδ /2 .

199
Values Probabilities Inequality Ref
2
−1, +1 Pr[Xi = −1] = Pr[Y ≥ ∆] ≤ e−∆ /2n Theorem 25.2.1
1 2
Pr[Xi = 1] = 2 Pr[Y ≤ −∆] ≤ e−∆ /2n Theorem 25.2.1
2
Pr[|Y| ≥ ∆] ≤ 2e−∆ /2n Corollary 25.2.2
Pr[Xi = 0] = h i
Pr Y − n2 ≥ ∆ ≤ 2e−2∆ /n
2
0, 1 1 Corollary 25.2.3
Pr[Xi = 1] = 2
Pr[Xi = 0] = 1 − pi eδ µ
0,1 Pr Y > (1 + δ)µ < (1+δ) Theorem 25.2.6
Pr[Xi = 1] = pi 1+δ

For δ ≤ 2e − 1 Pr Y > (1 + δ)µ < exp −µδ2 /4 Theorem 25.2.6

δ ≥ 2e − 1 Pr Y > (1 + δ)µ < 2−µ(1+δ)

For δ ≥ 0 Pr Y < (1 − δ)µ < exp −µδ2 /2 Theorem 25.2.9
P
Table 25.1: Summary of Chernoff type inequalities covered. Here we have n variables X1 , . . . , Xn , Y = i Xi
and µ = E[Y].

2
Definition 25.2.10 F − (µ, δ) = e−µδ /2 .
∆− (µ, ε) - what should be the value of δ, so that the probability is smaller than ε.
s
− 2 log 1/ε
∆ (µ, ε) =
µ
For large δ:
log2 (1/ε)
∆+ (µ, ε) < −1
µ

25.2.3 A More Convenient Form

Proof: (of simplified form of Theorem 25.2.6) Eq. (25.2) is just Exercise 25.4.1. As for Eq. (25.1), we prove this only for δ ≤ 1/2.
For details about the case 1/2 ≤ δ ≤ 2e − 1, see [MR95]. By Theorem 25.2.6, we have
!µ
eδ
Pr X > (1 + δ)µ < = exp(µδ − µ(1 + δ) ln(1 + δ)) .
(1 + δ)1+δ

The Taylor expansion of ln(1 + δ) is

δ2 δ3 δ4 δ2
δ− + − +·≥δ− ,
2 3 4 2
for δ ≤ 1. Thus,

Pr X > (1 + δ)µ < exp µ δ −(1 + δ) δ − δ2 /2 = exp µ δ − δ + δ2 /2 − δ2 + δ3 /2

≤ exp µ −δ2 /2 + δ3 /2 ≤ exp −µδ2 /4 ,

for δ ≤ 1/2.

25.3 Bibliographical notes

The exposition here follows more or less the exposition in [MR95]. The special symmetric case (Theorem 25.2.1) is taken from
[Cha01], although the proof is only very slightly simpler than the generalized form, it does yield a slightly better constant, and it
would be useful when discussing discrepancy.
An orderly treatment of probability is outside the scope of our discussion. The standard text on the topic is the book by Feller
[Fel91]. A more accessible text might be any introductory undergrad text on probability, in particular [MN98] has a nice chapter
on the topic.
Exercise 25.4.2 (without the hint) is from [Mat99].

200
25.4 Exercises
Exercise 25.4.1 (Simpler Tail Inequality.) [1 Points]
[2 Points] Prove that for δ > 2e − 1, we have
e (1+δ)µ
F + (µ, δ) < ≤ 2−(1+δ)µ .
1+δ

Exercise 25.4.2 (Chernoff inequality is tight.) [10 Points]

P
Let S = ni=1 S i be a sum of n independent random variables each attaining values +1 and −1 with equal probability. Let
P(n, ∆) = Pr[S > ∆]. Prove that for ∆ ≤ n/C, !
1 ∆2
P(n, ∆) ≥ exp − ,
C Cn
where C is a suitable constant. That is, the well-known Chernoff bound P(n, ∆) ≤ exp(−∆2 /2n)) is close to the truth.
[Hint: Use Stirling’s formula. There is also an elementary solution, using estimates for the middle binomial coffieicnets
[MN98, pages 83–84], but this solution is considerably more involved and yields unfriendly constants.]

Exercise 25.4.3 (Tail inequality for geometric variables.) [10 Points]

Let X1 , . . . , Xm be m independent random variables with geometric distribution with probability p (i.e., Pr Xi = j = (1 −
P
p) j−1 p). Let Y = i Xi , and let µ = E[Y] = m/p. Prove that
!
mδ2
Pr Y ≥ (1 + δ)µ ≤ exp − .
8

201
202
Bibliography

[AACS98] P. K. Agarwal, B. Aronov, T. M. Chan, and M. Sharir. On levels in arrangements of lines, segments, planes, and
triangles. Discrete Comput. Geom., 19:315–331, 1998.
[AB99] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge, 1999.
[AC07] P. Afshani and T. M. Chan. On approximate range counting and depth. In Proc. 23rd Annu. ACM Sympos. Comput.
Geom., pages 337–343, 2007.
[Ach01] D. Achlioptas. Database-friendly random projections. In Proc. 20th ACM Sympos. Principles Database Syst., pages
274–281, 2001.
[ACNS82] M. Ajtai, V. Chvátal, M. Newborn, and E. Szemerédi. Crossing-free subgraphs. Ann. Discrete Math., 12:9–12, 1982.
[AD97] P. K. Agarwal and P. K. Desikan. An efficient algorithm for terrain simplification. In Proc. 8th ACM-SIAM Sympos.
Discrete Algorithms, pages 139–147, 1997.
[AEIS99] A. Amir, A. Efrat, P. Indyk, and H. Samet. Efficient algorithms and regular data structures for dilation, location and
proximity problems. In Proc. 40th Annu. IEEE Sympos. Found. Comput. Sci., pages 160–170, 1999.
[AG86] N. Alon and E. Győri. The number of small semispaces of a finite set of points in the plane. J. Combin. Theory Ser.
A, 41:154–157, 1986.
[AGK+ 01] V. Arya, N. Garg, R. Khandekar, K. Munagala, and V. Pandit. Local search heuristic for k-median and facility
location problems. In Proc. 33rd Annu. ACM Sympos. Theory Comput., pages 21–29, 2001.
[AH05] B. Aronov and S. Har-Peled. On approximating the depth and related problems. In Proc. 16th ACM-SIAM Sympos.
Discrete Algorithms, pages 886–894, 2005.
[AHK06] S. Arora, E. Hazan, and S. Kale. Multiplicative weights method: a meta-algorithm and its applications. manuscript.
Available from , 2006.
[AHS07] B. Aronov, S. Har-Peled, and M. Sharir. On approximate halfspace range counting and relative epsilon-
approximations. In Proc. 23rd Annu. ACM Sympos. Comput. Geom., pages 327–336, 2007.
[AHV04] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measures of points. J. Assoc. Comput.
Mach., 51(4):606–635, 2004.
[AHY07] P. Agarwal, S. Har-Peled, and H. Yu. Embeddings of surfaces, curves, and moving points in euclidean space. In
Proc. 23rd Annu. ACM Sympos. Comput. Geom., pages 381–389, 2007.
[AKPW95] N. Alon, R. M. Karp, D. Peleg, and D. West. A graph-theoretic game and its application to the k-server problem.
SIAM J. Comput., 24(1):78–100, February 1995.
[AM94] P. K. Agarwal and J. Matoušek. On range searching with semialgebraic sets. Discrete Comput. Geom., 11:393–418,
1994.
[AM98] S. Arya and D. Mount. ANN: library for approximate nearest neighbor searching. http://www.cs.umd.edu/
~mount/ANN/, 1998.
[AM02] S. Arya and T. Malamatos. Linear-size approximate Voronoi diagrams. In Proc. 13th ACM-SIAM Sympos. Discrete
Algorithms, pages 147–155, 2002.
[AM04] S. Arya and D. M. Mount. Computational geometry: Proximity and location. In D. Mehta and S. Sahni, editors,
Handbook of Data Structures and Applications, chapter 63. CRC Press LLC, Boca Raton, FL, 2004. to appear.
[Ame94] N. Amenta. Helly-type theorems and generalized linear programming. Discrete Comput. Geom., 12:241–261, 1994.
[AMM02] S. Arya, T. Malamatos, and D. M. Mount. Space-efficient approximate Voronoi diagrams. In Proc. 34th Annu. ACM
Sympos. Theory Comput., pages 721–730, 2002.

203
[AMN+ 98] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest
neighbor searching in fixed dimensions. J. Assoc. Comput. Mach., 45(6), 1998.
[AMS94] P. K. Agarwal, J. Matoušek, and O. Schwarzkopf. Computing many faces in arrangements of lines and segments. In
Proc. 10th Annu. ACM Sympos. Comput. Geom., pages 76–84, 1994.
[APV02] P. K. Agarwal, C. M. Procopiuc, and K. R. Varadarajan. Approximation algorithms for k-line center. In Proc. 10th
Annu. European Sympos. Algorithms, pages 54–63, 2002.
[Aro98] S. Arora. Polynomial time approximation schemes for euclidean TSP and other geometric problems. J. Assoc.
Comput. Mach., 45(5):753–782, Sep 1998.
[AS00] N. Alon and J. H. Spencer. The probabilistic method. Wiley Inter-Science, 2nd edition, 2000.
[Aur91] F. Aurenhammer. Voronoi diagrams: A survey of a fundamental geometric data structure. ACM Comput. Surv.,
23:345–405, 1991.
[Bal97] K. Ball. An elementary introduction to modern convex geometry. In Flavors of geometry, volume MSRI Publ. 31.
Cambridge Univ. Press, 1997. http://www.msri.org/publications/books/Book31/files/ball.pdf.
[Bar96] Y. Bartal. Probabilistic approximations of metric space and its algorithmic application. In Proc. 37th Annu. IEEE
Sympos. Found. Comput. Sci., pages 183–193, October 1996.
[Bar98] Y. Bartal. On approximating arbitrary metrices by tree metrics. In Proc. 30th Annu. ACM Sympos. Theory Comput.,
pages 161–168, 1998.
[Bar02] A. Barvinok. A course in convexity, volume 54 of Graduate Studies in Mathematics. American Mathematical
Society, Providence, RI, 2002.
[BEG94] M. Bern, D. Eppstein, and J. Gilbert. Provably good mesh generation. J. Comput. Syst. Sci., 48:384–409, 1994.
[BH01] G. Barequet and S. Har-Peled. Efficiently approximating the minimum-volume bounding box of a point set in three
dimensions. J. Algorithms, 38:91–109, 2001.
[BHI02] M. Bădoiu, S. Har-Peled, and P. Indyk. Approximate clustering via coresets. In Proc. 34th Annu. ACM Sympos.
Theory Comput., pages 250–257, 2002.
[BM58] G. E.P. Box and M. E. Muller. A note on the generation of random normal deviates. Annl. Math. Stat., 28:610–611,
1958.
[BMP05] P. Brass, W. Moser, and J. Pach. Research Problems in Discrete Geometry. Springer, 2005.
[BVZ01] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern
Anal. Mach. Intell., 23(11):1222–1239, 2001.
[BY98] J.-D. Boissonnat and M. Yvinec. Algorithmic Geometry. Cambridge University Press, UK, 1998. Translated by H.
Brönnimann.
[Cal95] P. B. Callahan. Dealing with higher dimensions: the well-separated pair decomposition and its applications. Ph.D.
thesis, Dept. Comput. Sci., Johns Hopkins University, Baltimore, Maryland, 1995.
[Car76] L. Carroll. The hunting of the snark, 1876.
[CF90] B. Chazelle and J. Friedman. A deterministic view of random sampling and its use in geometry. Combinatorica,
10(3):229–249, 1990.
[Cha96] T. M. Chan. Fixed-dimensional linear programming queries made easy. In Proc. 12th Annu. ACM Sympos. Comput.
Geom., pages 284–290, 1996.
[Cha98] T. M. Chan. Approximate nearest neighbor queries revisited. Discrete Comput. Geom., 20:359–373, 1998.
[Cha01] B. Chazelle. The Discrepancy Method: Randomness and Complexity. Cambridge University Press, New York, 2001.
[Cha02] T. M. Chan. Closest-point problems simplified on the ram. In Proc. 13th ACM-SIAM Sympos. Discrete Algorithms,
pages 472–473. Society for Industrial and Applied Mathematics, 2002.
[Cha05] T. M. Chan. Low-dimensional linear programming with violations. SIAM J. Comput., pages 879–893, 2005.
[Cha06] T. M. Chan. Faster core-set constructions and data-stream algorithms in fixed dimensions. Comput. Geom. Theory
Appl., 35(1-2):20–35, 2006.
[Che86] L. P. Chew. Building Voronoi diagrams for convex polygons in linear expected time. Technical Report PCS-TR90-
147, Dept. Math. Comput. Sci., Dartmouth College, Hanover, NH, 1986.
[Che06] K. Chen. On k-median clustering in high dimensions. In Proc. 17th ACM-SIAM Sympos. Discrete Algorithms, pages
1177–1185, 2006.

204
[Che07] K. Chen. A constant factor approximation algorithm for k-median with outliers. manuscript, 2007.
[CK95] P. B. Callahan and S. R. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-
neighbors and n-body potential fields. J. Assoc. Comput. Mach., 42:67–90, 1995.
[CKMN01] M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan. Algorithms for facility location problems with outliers.
In Proc. 12th ACM-SIAM Sympos. Discrete Algorithms, pages 642–651, 2001.
[CKR01] G. Calinescu, H. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension problem. In Proceedings of
the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 8–16. Society for Industrial and Applied
Mathematics, 2001.
[Cla83] K. L. Clarkson. Fast algorithms for the all nearest neighbors problem. In Proc. 24th Annu. IEEE Sympos. Found.
Comput. Sci., pages 226–232, 1983.
[Cla87] K. L. Clarkson. New applications of random sampling in computational geometry. Discrete Comput. Geom., 2:195–
222, 1987.
[Cla88] K. L. Clarkson. Applications of random sampling in computational geometry, II. In Proc. 4th Annu. ACM Sympos.
Comput. Geom., pages 1–11, 1988.
[Cla93] K. L. Clarkson. Algorithms for polytope covering and approximation. In Proc. 3th Workshop Algorithms Data
Struct., volume 709 of Lect. Notes in Comp. Sci., pages 246–252. Springer-Verlag, 1993.
[Cla94] K. L. Clarkson. An algorithm for approximate closest-point queries. In Proc. 10th Annu. ACM Sympos. Comput.
Geom., pages 160–164, 1994.
[Cla95] K. L. Clarkson. Las Vegas algorithms for linear and integer programming. J. Assoc. Comput. Mach., 42:488–499,
1995.
[CLRS01] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press / McGraw-Hill,
Cambridge, Mass., 2001.
[CM96] B. Chazelle and J. Matoušek. On linear-time deterministic algorithms for optimization problems in fixed dimension.
J. Algorithms, 21:579–597, 1996.
[CMS93] K. L. Clarkson, K. Mehlhorn, and R. Seidel. Four results on randomized incremental constructions. Comput. Geom.
Theory Appl., 3(4):185–212, 1993.
[CS89] K. L. Clarkson and P. W. Shor. Applications of random sampling in computational geometry, II. Discrete Comput.
Geom., 4:387–421, 1989.
[CS00] N. Cristianini and J. Shaw-Taylor. Support Vector Machines. Cambridge Press, 2000.
[CW89] B. Chazelle and E. Welzl. Quasi-optimal range searching in spaces of finite VC-dimension. Discrete Comput. Geom.,
4:467–489, 1989.
[dBDS95] M. de Berg, K. Dobrindt, and O. Schwarzkopf. On lazy randomized incremental construction. Discrete Comput.
Geom., 14:261–286, 1995.
[dBS95] M. de Berg and O. Schwarzkopf. Cuttings and applications. Internat. J. Comput. Geom. Appl., 5:343–355, 1995.
[dBvKOS00] M. de Berg, M. van Kreveld, M. H. Overmars, and O. Schwarzkopf. Computational Geometry: Algorithms and
Applications. Springer-Verlag, 2nd edition, 2000.
[Dey98] T. K. Dey. Improved bounds for planar k-sets and related problems. Discrete Comput. Geom., 19(3):373–382, 1998.
[DG03] S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Rand. Struct. Alg.,
22(3):60–65, 2003.
[DK85] D. P. Dobkin and D. G. Kirkpatrick. A linear algorithm for determining the separation of convex polyhedra. J.
Algorithms, 6:381–392, 1985.
[DNIM04] M. Datar, Immorlica N, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distribu-
tions. In Proc. 20th Annu. ACM Sympos. Comput. Geom., pages 253–262, 2004.
[Dud74] R. M. Dudley. Metric entropy of some classes of sets with differentiable boundaries. J. Approx. Theory, 10(3):227–
236, 1974.
[Dun99] C. A. Duncan. Balanced Aspect Ratio Trees. Ph.D. thesis, Department of Computer Science, Johns Hopkins Uni-
versity, Baltimore, Maryland, 1999.
[Dur95] R. Durrett. Probability: Theory and Examples. Duxbury Press, August 1995.
[EGS05] D. Eppstein, M. T. Goodrich, and J. Z. Sun. The skip quadtree: a simple dynamic data structure for multidimensional
data. In Proc. 21st Annu. ACM Sympos. Comput. Geom., pages 296–305. ACM, June 2005.

205
[EK89] O. Egecioglu and B. Kalantari. Approximating the diameter of a set of points in the Euclidean space. Inform.
Process. Lett., 32:205–211, 1989.
[Ele97] G. Elekes. On the number of sums and products. ACTA Arithmetica, pages 365–367, 1997.
[ERvK96] H. Everett, J.-M. Robert, and M. van Kreveld. An optimal algorithm for the (≤k)-levels, with applications to separa-
tion and transversal problems. Internat. J. Comput. Geom. Appl., 6:247–261, 1996.
[Fel71] W. Feller. An Introduction to Probability Theory and its Applications, volume II. John Wiley & Sons, NY, 1971.
[Fel91] W. Feller. An Introduction to Probability Theory and its Applications. John Wiley & Sons, NY, 1991.
[FG88] T. Feder and D. H. Greene. Optimal algorithms for approximate clustering. In Proc. 20th Annu. ACM Sympos.
Theory Comput., pages 434–444, 1988.
[FGK+ 00] A. Fabri, G.-J. Giezeman, L. Kettner, S. Schirra, and S. Schönherr. On the design of CGAL a computational geometry
algorithms library. Softw. – Pract. Exp., 30(11):1167–1202, 2000.
[FH05] J. Fischer and S. Har-Peled. Dynamic well-separated pair decomposition made easy. In CCCG, pages 235–238,
2005.
[FH06] J. Fischer and S. Har-Peled. On coresets for clustering and related problems. manuscript, 2006.
[For97] S. Fortune. Voronoi diagrams and Delaunay triangulations. In J. E. Goodman and J. O’Rourke, editors, Handbook
of Discrete and Computational Geometry, chapter 20. CRC Press LLC, Boca Raton, FL, 1997.
[FRT03] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by tree metrics. In
Proc. 35th Annu. ACM Sympos. Theory Comput., pages 448–455, 2003.
[FS97] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.
J. Comput. Syst. Sci., 55(1):119–139, 1997.
[Gar82] I. Gargantini. An effective way to represent quadtrees. Commun. ACM, 25(12):905–910, 1982.
[Gar02] R. J. Gardner. The Brunn-Minkowski inequality. Bull. Amer. Math. Soc., 39:355–405, 2002.
[GK92] P. Gritzmann and V. Klee. Inner and outer j-radii of convex bodies in finite-dimensional normed spaces. Discrete
Comput. Geom., 7:255–280, 1992.
[GLS88] M. Grötschel, L. Lovász, and A. Schrijver. Geometric Algorithms and Combinatorial Optimization, volume 2 of
Algorithms and Combinatorics. Springer-Verlag, Berlin Heidelberg, 2nd edition, 1988. 2nd edition 1994.
[Gol95] M. Goldwasser. A survey of linear programming in randomized subexponential time. SIGACT News, 26(2):96–104,
1995.
[Gon85] T. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoret. Comput. Sci., 38:293–306, 1985.
[GP84] J. E. Goodman and R. Pollack. On the number of k-subsets of a set of n points in the plane. J. Combin. Theory Ser.
A, 36:101–104, 1984.
[GRSS95] M. Golin, R. Raman, C. Schwarz, and M. Smid. Simple randomized algorithms for closest pair problems. Nordic J.
Comput., 2:3–27, 1995.
[Grü03] B. Grünbaum. Convex Polytopes. Springer, 2nd edition, May 2003. Prepared by Volker Kaibel, Victor Klee, and
Günter Ziegler.
[GT00] A. Gupta and E. Tardos. A constant factor approximation algorithm for a class of classification problems. In Proc.
32nd Annu. ACM Sympos. Theory Comput., pages 652–658, 2000.
[Gup00] A. Gupta. Embeddings of Finite Metrics. PhD thesis, University of California, Berkeley, 2000.
[Har00a] S. Har-Peled. Constructing planar cuttings in theory and practice. SIAM J. Comput., 29(6):2016–2039, 2000.
[Har00b] S. Har-Peled. Taking a walk in a planar arrangement. SIAM J. Comput., 30(4):1341–1367, 2000.
[Har01a] S. Har-Peled. A practical approach for computing the diameter of a point-set. In Proc. 17th Annu. ACM Sympos.
Comput. Geom., pages 177–186, 2001.
[Har01b] S. Har-Peled. A replacement for Voronoi diagrams of near linear size. In Proc. 42nd Annu. IEEE Sympos. Found.
Comput. Sci., pages 94–103, 2001.
[HI00] S. Har-Peled and P. Indyk. When crossings count - approximating the minimum spanning tree. In Proc. 16th Annu.
ACM Sympos. Comput. Geom., pages 166–175, 2000.
[HM03] S. Har-Peled and S. Mazumdar. Fast algorithms for computing the smallest k-enclosing disc. In Proc. 11th Annu.
European Sympos. Algorithms, volume 2832 of Lect. Notes in Comp. Sci., pages 278–288. Springer-Verlag, 2003.

206
[HM04] S. Har-Peled and S. Mazumdar. Coresets for k-means and k-median clustering and their applications. In Proc. 36th
Annu. ACM Sympos. Theory Comput., pages 291–300, 2004.
[HM06] S. Har-Peled and M. Mendel. Fast construction of nets in low dimensional metrics, and their applications. SIAM J.
Comput., 35(5):1148–1184, 2006.
[HS06] S. Har-Peled and M. Sharir. Relative ε-approximations in geometry. Manuscript. Available from http://www.
uiuc.edu/~sariel/papers/06/integrate, 2006.
[HÜ05] S. Har-Peled and A. Üngör. A time-optimal delaunay refinement algorithm in two dimensions. In Proc. 21st Annu.
ACM Sympos. Comput. Geom., pages 228–236, 2005.
[HW87] D. Haussler and E. Welzl. ε-nets and simplex range queries. Discrete Comput. Geom., 2:127–151, 1987.
[IM98] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc.
30th Annu. ACM Sympos. Theory Comput., pages 604–613, 1998.
[Ind99] P. Indyk. Sublinear time algorithms for metric space problems. In Proc. 31st Annu. ACM Sympos. Theory Comput.,
pages 154–159, 1999.
[Ind01] P. Indyk. Algorithmic applications of low-distortion geometric embeddings. In Proc. 42nd Annu. IEEE Sympos.
Found. Comput. Sci., pages 10–31, 2001. Tutorial.
[Ind04] P. Indyk. Nearest neighbors in high-dimensional spaces. In J. E. Goodman and J. O’Rourke, editors, Handbook of
Discrete and Computational Geometry, chapter 39, pages 877–892. CRC Press LLC, Boca Raton, FL, 2nd edition,
2004.
[Joh48] F. John. Extremum problems with inequalities as subsidary conditions. Courant Anniversary, pages 187–204, 1948.
[Kal92] G. Kalai. A subexponential randomized simplex algorithm. In Proc. 24th Annu. ACM Sympos. Theory Comput.,
pages 475–482, 1992.
[KF93] I. Kamel and C. Faloutsos. On packing r-trees. In Proc. 2nd Intl. CConf. Info. Knowl. Mang., pages 490–499, 1993.
[Kle02] J. Kleinberg. An impossibility theorem for clustering. In Neural Info. Proc. Sys., 2002.
[KLMN04] R. Krauthgamer, J. R. Lee, M. Mendel, and A. Naor. Measured descent: A new embedding method for finite metric
spaces. In Proc. 45th Annu. IEEE Sympos. Found. Comput. Sci., page to appear, 2004.
[KMN+ 04] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation
algorithm for k-means clustering. Comput. Geom. Theory Appl., 28:89–112, 2004.
[KOR00] E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional
spaces. SIAM J. Comput., 2(30):457–474, 2000.
[KS06] H. Kaplan and M. Sharir. Randomized incremental constructions of three-dimensional convex hulls and planar
voronoi diagrams, and approximate range counting. In Proc. 17th ACM-SIAM Sympos. Discrete Algorithms, pages
484–493, 2006.
[KT06] J. Kleinberg and E. Tardos. Algorithm design. Addison-Wesley, 2006.
[Lei84] F. T. Leighton. New lower bound techniques for VLSI. Math. Syst. Theory, 17:47–70, 1984.
[Leo98] S. J. Leon. Linear Algebra with Applications. Prentice Hall, 5th edition, 1998.
[Lit88] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Mach. Learn.,
2(4):285–318, 1988.
[LLS01] Y. Li, P. M. Long, and A. Srinivasan. Improved bounds on the sample complexity of learning. J. Comput. Syst. Sci.,
62(3):516–527, 2001.
[Mac50] A.M. Macbeath. A compactness theorem for affine equivalence-classes of convex regions. Canad. J. Math, 3:54–61,
1950.
[Mag02] A. Magen. Dimensionality reductions that preserve volumes and distance to affine spaces, and their algorithmic
applications. In The 6th Intl. Work. Rand. Appr. Tech. Comp. Sci., pages 239–253, 2002.
[Mat90] J. Matoušek. Bi-lipschitz embeddings into low-dimensional euclidean spaces. Comment. Math. Univ. Carolinae,
31:589–600, 1990.
[Mat92] J. Matoušek. Efficient partition trees. Discrete Comput. Geom., 8:315–334, 1992.
[Mat95a] J. Matoušek. On enclosing k points by a circle. Inform. Process. Lett., 53:217–221, 1995.
[Mat95b] J. Matoušek. On geometric optimization with few violated constraints. Discrete Comput. Geom., 14:365–384, 1995.
[Mat98] J. Matoušek. On constants for cuttings in the plane. Discrete Comput. Geom., 20:427–448, 1998.

207
[Mat99] J. Matoušek. Geometric Discrepancy. Springer, 1999.
[Mat02] J. Matoušek. Lectures on Discrete Geometry. Springer, 2002.
[Meg83] N. Megiddo. Linear-time algorithms for linear programming in R3 and related problems. SIAM J. Comput.,
12(4):759–776, 1983.
[Meg84] N. Megiddo. Linear programming in linear time when the dimension is fixed. J. Assoc. Comput. Mach., 31:114–127,
1984.
[Mil04] G. L. Miller. A time efficient Delaunay refinement algorithm. In Proc. 15th ACM-SIAM Sympos. Discrete Algorithms,
pages 400–409, 2004.
[MN98] J. Matoušek and J. Nešetřil. Invitation to Discrete Mathematics. Oxford Univ Pr, 1998.
[MP03] R. R. Mettu and C. G. Plaxton. The online median problem. SIAM J. Comput., 32(3):816–832, 2003.
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, New York, NY, 1995.
[MSW96] J. Matoušek, M. Sharir, and E. Welzl. A subexponential bound for linear programming. Algorithmica, 16:498–516,
1996.
[Mul94a] K. Mulmuley. Computational Geometry: An Introduction Through Randomized Algorithms. Prentice Hall, Engle-
wood Cliffs, NJ, 1994.
[Mul94b] K. Mulmuley. An efficient algorithm for hidden surface removal, II. J. Comp. Sys. Sci., 49:427–453, 1994.
[O’R85] J. O’Rourke. Finding minimal enclosing boxes. Internat. J. Comput. Inform. Sci., 14:183–199, 1985.
[OvL81] M. H. Overmars and J. van Leeuwen. Maintenance of configurations in the plane. J. Comput. Syst. Sci., 23:166–204,
1981.
[Rab76] M. O. Rabin. Probabilistic algorithms. In J. F. Traub, editor, Algorithms and Complexity: New Directions and Recent
Results, pages 21–39. Academic Press, New York, NY, 1976.
[Rup93] J. Ruppert. A new and simple algorithm for quality 2-dimensional mesh generation. In Proc. 4th ACM-SIAM Sympos.
Discrete Algorithms, pages 83–92, 1993.
[SA95] M. Sharir and P. K. Agarwal. Davenport-Schinzel Sequences and Their Geometric Applications. Cambridge Uni-
versity Press, New York, 1995.
[Sag94] H. Sagan. Space-Filling Curves. Springer-Verlag, 1994.
[Sam89] H. Samet. Spatial Data Structures: Quadtrees, Octrees, and Other Hierarchical Methods. Addison-Wesley, Reading,
MA, 1989.
[Sei91] R. Seidel. Small-dimensional linear programming and convex hulls made easy. Discrete Comput. Geom., 6:423–434,
1991.
[Sei93] R. Seidel. Backwards analysis of randomized geometric algorithms. In J. Pach, editor, New Trends in Discrete and
Computational Geometry, volume 10 of Algorithms and Combinatorics, pages 37–68. Springer-Verlag, 1993.
[Sha03] M. Sharir. The Clarkson-Shor technique revisited and extended. Comb., Prob. & Comput., 12(2):191–201, 2003.
[Smi00] M. Smid. Closest-point problems in computational geometry. In Jörg-Rüdiger Sack and Jorge Urrutia, editors,
Handbook of Computational Geometry, pages 877–935. Elsevier Science Publishers B. V. North-Holland, Amster-
dam, 2000.
[SSS02] Y. Sabharwal, N. Sharma, and S. Sen. Improved reductions of nearest neighbors search to plebs with applications to
linear-sized approximate voronoi decopositions. In Proc. 22nd Conf. Found. Soft. Tech. Theoret. Comput. Sci., pages
311–323, 2002.
[Sto91] J. Stolfi. Oriented Projective Geometry: A Framework for Geometric Computations. Academic Press, New York,
NY, 1991.
[SW92] M. Sharir and E. Welzl. A combinatorial bound for linear programming and related problems. In Proc. 9th Sympos.
Theoret. Aspects Comput. Sci., volume 577 of Lect. Notes in Comp. Sci., pages 569–579. Springer-Verlag, 1992.
[Szé97] L. A. Székely. Crossing numbers and hard Erdős problems in discrete geometry. Combinatorics, Probability and
Computing, 6:353–358, 1997.
[Tót01] G. Tóth. Point sets with many k-sets. Discrete Comput. Geom., 26(2):187–194, 2001.
[Tou83] G. T. Toussaint. Solving geometric problems with the rotating calipers. In Proc. IEEE MELECON ’83, pages
A10.02/1–4, 1983.
[Üng04] A. Üngör. Off-centers: A new type of steiner points for computing size-optimal quality-guaranteed delaunay trian-
gulations. In Latin Amer. Theo. Inf. Symp., pages 152–161, 2004.

208
[Vai86] P. M. Vaidya. An optimal algorithm for the all-nearest-neighbors problem. In Proc. 27th Annu. IEEE Sympos. Found.
Comput. Sci., pages 117–122, 1986.
[Van97] R. J. Vanderbei. Linear programming: Foundations and extensions. Kluwer, 1997.
[VC71] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their proba-
bilities. Theory Probab. Appl., 16:264–280, 1971.
[Wel86] E. Welzl. More on k-sets of finite sets in the plane. Discrete Comput. Geom., 1:95–100, 1986.
[Wel92] E. Welzl. On spanning trees with low crossing numbers. In Data Structures and Efficient Algorithms, Final Report
on the DFG Special Joint Initiative, volume 594 of Lect. Notes in Comp. Sci., pages 233–249. Springer-Verlag, 1992.
[WVTP97] M. Waldvogel, G. Varghese, J. Turener, and B. Plattner. Scalable high speed ip routing lookups. In Proc. ACM
SIGCOMM 97, Octeber 1997.
[YAPV04] H. Yu, P. K. Agarwal, R. Poreddy, and K. R. Varadarajan. Practical methods for shape fitting and kinetic data
structures using core sets. In Proc. 20th Annu. ACM Sympos. Comput. Geom., pages 263–272, 2004.

209
Index

(1 + ε)-approximate nearest neighbor, 115 cluster, 52

(ε, p)-sample, 75 dictator, 56
C-Lipschitz, 189 swap, 55
H-Polyhedron, 106 combinatorial dimension, 80, 88, 101
K-bi-Lipschitz, 189 compressed quadtree, 25
K-embedding, 138 cone, 98, 108
k-level, 85 conflict list, 78
k-set, 87 conflict-graph, 78
t-ring tree, 119 coreset, 171
ε-net, 68 cylindrical shell width, 174
ε-sample, 67 directional width
ε-shell set, 177 moving points, 173
LP-type extent of hyperplanes, 170
basis, 101 for directional width, 165
problem, 101 vertical extent of points, 170
WSPD, 40 corner, 32
Covering property, 54
above, 183 covering radius, 21
acceptable, 99 critical, 15
ANN (approximate nearest neighbor), 115 crosses, 155
ANN, approximate nearest neighbor, 115 crossing distance, 154
aspect ratio, 31 crowded, 34
cutting, 82
ball, 53
basic operations, 101 defining set, 80, 88
basis, 98 Delaunay triangulation, 34
below, 183 depth, 91, 100
bi-Lipschitz, 138 dimension, 110
brick set, 133 directional width, 165
discrepancy, 70
canonical grid, 26 compatible, 70
canonical square, 26 cross, 71
cell, 13 distance
cell query, 37, 117 nodes, 41
Chernoff inequality, 199 point to set, 52, 54
simplified form, 199 distortion, 138, 189
closest pair, 45 dominates, 56
clustering, 51 double factorial, 136
k-center, 52 drifters, 56
price, 52 dual, 185
problem, 52 line, 183
k-means, 59 point, 183
price, 59 dual range space, 66
problem, 59 dual shatter function, 66
k-median, 54 dual shattering dimension, 66
price, 54
problem, 54 edge, 98, 113

210
edge ratio, 32 order
embedding, 189 Q, 29
excess, 18, 20 z, 29
expansive mapping, 190 outliers, 60
exponential distribution, 139
extended cluster, 32 passes, 108
extent, 169, 184 Peano curve, 37
planar, 86
face, 110 point above hyperplane, 185
facility location, 60 point below hyperplane, 185
fair split tree, 48 Poisson distribution, 139
feasible solution, 97 polyhedron, 97
finger tree, 26 polytope, 110
probabilistic distortion, 192
gamma distribution, 139 Problem
Gradation, 27 Dominating Set, 59
gradation, 17, 27 Satisfiability, 60
greedy permutation, 53 Set Cover, 153, 156, 159
grid, 13 Set Covering, 179
cluster, 13 Traveling Salesperson, 60
ground set, 63 uniqueness, 15
Vertex Cover, 60
heavy, 19
quadtree
incidence
balanced, 32
line-point, 87
compressed, 25
isoperimetric inequality, 135
linear, 35
killing set, 80, 88
radius, 15
lazy randomize incremental construction, 83 Random sampling
level, 24, 85, 116 Weighted Sets, 178
line Randomized Incremental Construction, 77
support, 183 range, 63
linear program range space, 63
vertex, 97 projection, 63
Linear programming, 97 region, 25
linear programming RIC, 77
unbounded, 97 ring tree, 119
linearization, 171
Lipschitz, 137 separated
local search, 54, 60 sets, 39
k-median clustering, 55 Separation property, 54
lower convex chain, 185 separator, tree, 26
lower envelope, 169, 184 shatter function, 65
shattered, 63
median, 137 shattering dimension, 65
metric, 51, 189 simplex, 98, 113
metric space, 51, 189 simulated annealing, 60
metric spaces sink, 113
low doubling dimension, 49 skip-quadtree, 27
Metropolis algorithm, 60 spanner, 43
Minkowski sum, 133 sponginess, 50
moments technique, 82 spread, 24, 193
all regions, 80, 88 squared, 50
monotone ε-shell set, 177 stretch, 43
successful, 157
nearest neighbor, 45
net, 54 target function, 97
normal distribution, 139 Theorem 14.1.2, 134

211
upper convex chain, 185
upper envelope, 169, 184

VC dimension, 63
vertex, 110
vertex figure, 111
vertical decomposition, 77
vertex, 77
visibility polygon, 157
Voronoi
partition, 52
Voronoi diagram, 186

weight, 85
region, 80, 88
well-balanced, 32
well-separated pairs decomposition, 39
width, 13
WSPD, 39, 40
generator, 46

212

Chapter One Background of The Study
No ratings yet
Chapter One Background of The Study
53 pages
Chapter 10 Homomorphisms
No ratings yet
Chapter 10 Homomorphisms
20 pages
Internship Report of PASTIC
No ratings yet
Internship Report of PASTIC
27 pages
Bangladesh Telecom Market (2023 - 2028)
No ratings yet
Bangladesh Telecom Market (2023 - 2028)
47 pages
4 Block Cipher and DES
No ratings yet
4 Block Cipher and DES
38 pages
21 Data-Dissemination-In-Wireless-Sensor-Networks
No ratings yet
21 Data-Dissemination-In-Wireless-Sensor-Networks
12 pages
Handout 19
No ratings yet
Handout 19
19 pages
BCA (Hons.) : Faculty of Engineering &technology
No ratings yet
BCA (Hons.) : Faculty of Engineering &technology
32 pages
Excel Practise Work
No ratings yet
Excel Practise Work
49 pages
BSC CS 2018 Admission Syllabus
No ratings yet
BSC CS 2018 Admission Syllabus
39 pages
Discrete Structures Lecture 12
No ratings yet
Discrete Structures Lecture 12
26 pages
I Recruitment Steps
No ratings yet
I Recruitment Steps
6 pages
Examen Omegaup
No ratings yet
Examen Omegaup
23 pages
Handout 20
No ratings yet
Handout 20
22 pages
CRM Guide Small Business
No ratings yet
CRM Guide Small Business
16 pages
Automated Weather Observing System (AWOS) : Purposed Periodic Check Sheet
No ratings yet
Automated Weather Observing System (AWOS) : Purposed Periodic Check Sheet
6 pages
Experiment-II: Data Flow Diagrams
No ratings yet
Experiment-II: Data Flow Diagrams
6 pages
Cmu850 f20
No ratings yet
Cmu850 f20
309 pages
Murray 2020 Virtual Reality How To Tell The Difference
No ratings yet
Murray 2020 Virtual Reality How To Tell The Difference
17 pages
NIAGARA EC-Net AX STUDY BOOK AND PRACTICE PDF
50% (2)
NIAGARA EC-Net AX STUDY BOOK AND PRACTICE PDF
470 pages
Panasonic KX NS1000700
No ratings yet
Panasonic KX NS1000700
14 pages
Linear Algebra: Answers To Exercises
No ratings yet
Linear Algebra: Answers To Exercises
404 pages
MCITP Certifications in A Nutshell
No ratings yet
MCITP Certifications in A Nutshell
4 pages
DS-2CD3725G0-IZS 2 MP IR Varifocal Dome Network Camera: Key Features
No ratings yet
DS-2CD3725G0-IZS 2 MP IR Varifocal Dome Network Camera: Key Features
5 pages
Ch9-Internet service-ICT-Grade 6-CAIE (2020-21)
No ratings yet
Ch9-Internet service-ICT-Grade 6-CAIE (2020-21)
5 pages
Handout 18
No ratings yet
Handout 18
3 pages
How To Find All Possible Permutations of Symmetric Group On 5 Elements, S ?
100% (2)
How To Find All Possible Permutations of Symmetric Group On 5 Elements, S ?
3 pages
Assignmet-8
No ratings yet
Assignmet-8
5 pages
The Pigeonhole Principle PDF
100% (1)
The Pigeonhole Principle PDF
5 pages
Andela Scholarship
No ratings yet
Andela Scholarship
2 pages
Course List & Its Durations
No ratings yet
Course List & Its Durations
10 pages
Streaming Algorithms
No ratings yet
Streaming Algorithms
73 pages
X 1 / Time-Delay, Glass Tube Fuses: MDL Series
No ratings yet
X 1 / Time-Delay, Glass Tube Fuses: MDL Series
2 pages
Office 2019 Activation Direction
No ratings yet
Office 2019 Activation Direction
17 pages
Solution of Non Linear Equations
No ratings yet
Solution of Non Linear Equations
22 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
7 pages
Ideal
No ratings yet
Ideal
8 pages
in 0000 Sa 00002
No ratings yet
in 0000 Sa 00002
1 page
MFCS Unit 1
100% (1)
MFCS Unit 1
285 pages
Euclidean Geometry HTTP
No ratings yet
Euclidean Geometry HTTP
33 pages
Extra Resources For Intern Season
No ratings yet
Extra Resources For Intern Season
1 page
Terraform Associate Study Guide
No ratings yet
Terraform Associate Study Guide
3 pages
(Rcei! J$E: Iits, Other-Gtr'Tls For
No ratings yet
(Rcei! J$E: Iits, Other-Gtr'Tls For
1 page
Graph Theory Answer Key
100% (1)
Graph Theory Answer Key
55 pages
Graph Theory and Its Applications Quiz 1 MS/ PHD Mathematics
No ratings yet
Graph Theory and Its Applications Quiz 1 MS/ PHD Mathematics
2 pages
Factor Group Computations and Simple Groups
No ratings yet
Factor Group Computations and Simple Groups
48 pages
Graph Theory Assignment-4
No ratings yet
Graph Theory Assignment-4
11 pages
AI QB 2025
No ratings yet
AI QB 2025
28 pages
Fibre Channel Frame Format Terminology
100% (1)
Fibre Channel Frame Format Terminology
2 pages
Lecture Notes For Abstract Algebra I: Department of Mathematics
No ratings yet
Lecture Notes For Abstract Algebra I: Department of Mathematics
175 pages
DML (UE18CS205) - Unit 5 (Algebraic Structures)
No ratings yet
DML (UE18CS205) - Unit 5 (Algebraic Structures)
61 pages
Mathematics in Modern World
100% (1)
Mathematics in Modern World
30 pages
Lecture Notes For Foundations of Mathematics
No ratings yet
Lecture Notes For Foundations of Mathematics
101 pages
Abstract Algebra Cheat Sheet
No ratings yet
Abstract Algebra Cheat Sheet
15 pages
Graph Theory PDF
100% (1)
Graph Theory PDF
213 pages
Analytical Solid Geometry
No ratings yet
Analytical Solid Geometry
34 pages
Hints BOndy PDF
No ratings yet
Hints BOndy PDF
45 pages
04.2 Applications of Derivatives (Word Problems) Rev2018
100% (1)
04.2 Applications of Derivatives (Word Problems) Rev2018
5 pages
Sets: ST Ephane Bressan
No ratings yet
Sets: ST Ephane Bressan
74 pages
Logic & Set Theory Midterm Exam
No ratings yet
Logic & Set Theory Midterm Exam
5 pages
Lesson 3. Recreational Mathematics
No ratings yet
Lesson 3. Recreational Mathematics
10 pages
Test 1 Discrete Math
No ratings yet
Test 1 Discrete Math
3 pages
Abstract Algebra
No ratings yet
Abstract Algebra
16 pages
Sequence and Series
No ratings yet
Sequence and Series
31 pages
Unit-4 Graph Theory 1
No ratings yet
Unit-4 Graph Theory 1
12 pages
AI Lab1
No ratings yet
AI Lab1
10 pages
GRO UP: Marty John Pinuela
No ratings yet
GRO UP: Marty John Pinuela
32 pages
Assignment # 2 - Solution
No ratings yet
Assignment # 2 - Solution
22 pages
Set Theory Workshop No.1: Math 7 Grade
No ratings yet
Set Theory Workshop No.1: Math 7 Grade
2 pages
Infinity and Its Cardinalities
No ratings yet
Infinity and Its Cardinalities
88 pages
Chapter XI Group Theory
No ratings yet
Chapter XI Group Theory
13 pages
Measure of Central Tendency Grouped Data
No ratings yet
Measure of Central Tendency Grouped Data
10 pages
20110715122354-B SC - M SC - H S - Mathematics PDF
0% (1)
20110715122354-B SC - M SC - H S - Mathematics PDF
75 pages
Discrete Structures
No ratings yet
Discrete Structures
2 pages
Assignment 6 Probability & Probability Distribution PDF
No ratings yet
Assignment 6 Probability & Probability Distribution PDF
5 pages
Relations
No ratings yet
Relations
10 pages
Statistics Hounours 2nd Year Syllabus
No ratings yet
Statistics Hounours 2nd Year Syllabus
9 pages
6 Primitive Roots and The Discrete Logarithm: 6.1 The Order of An Integer
No ratings yet
6 Primitive Roots and The Discrete Logarithm: 6.1 The Order of An Integer
16 pages
11
No ratings yet
11
3 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Modular Arithmetic and Divisibility
No ratings yet
Modular Arithmetic and Divisibility
2 pages
03 - Limits, Indeterminate Forms
No ratings yet
03 - Limits, Indeterminate Forms
8 pages
9.4 Polar Coordinates
No ratings yet
9.4 Polar Coordinates
10 pages
STA 2023 3-7 Counting and Permutations
No ratings yet
STA 2023 3-7 Counting and Permutations
4 pages
Module 2 Activity "Logic and Truth Tables"
No ratings yet
Module 2 Activity "Logic and Truth Tables"
3 pages
Modern Geometry New Format
100% (1)
Modern Geometry New Format
7 pages
Geometric Progression
No ratings yet
Geometric Progression
8 pages
Complex Analysis Practicals
No ratings yet
Complex Analysis Practicals
15 pages
K Map
No ratings yet
K Map
32 pages
Lesson Plan
No ratings yet
Lesson Plan
12 pages
Yoni Miller - Invariance-Principle
No ratings yet
Yoni Miller - Invariance-Principle
4 pages
Number Theory Notes Anwar Khan
No ratings yet
Number Theory Notes Anwar Khan
219 pages
Euler's Theorem
No ratings yet
Euler's Theorem
2 pages
Mathematics in Modern World
No ratings yet
Mathematics in Modern World
2 pages
Mathematics (Introduction To Linear Algebra)
No ratings yet
Mathematics (Introduction To Linear Algebra)
36 pages
Week 13
No ratings yet
Week 13
17 pages