Kubernetes in Action (2018) PDF
Kubernetes in Action (2018) PDF
MANNING
Kubernetes resources covered in the book
Pod (po) [v1] The basic deployable unit containing one or more 3.1
processes in co-located containers
ReplicaSet (rs) [apps/v1beta2**] Keeps one or more pod replicas running 4.3
Deploying workloads
DaemonSet (ds) [apps/v1beta2**] Runs one pod replica per node (on all nodes or 4.4
only on those matching a node selector)
StatefulSet (sts) [apps/v1beta1**] Runs stateful pods with a stable identity 10.2
Service (svc) [v1] Exposes one or more pods at a single and stable 5.1
IP address and port pair
Services
Endpoints (ep) [v1] Defines which pods (or other servers) are 5.2.1
exposed through a service
Ingress (ing) [extensions/v1beta1] Exposes one or more services to external clients 5.4
through a single externally reachable IP address
ConfigMap (cm) [v1] A key-value map for storing non-sensitive config 7.4
Config
PersistentVolume* (pv) [v1] Points to persistent storage that can be mounted 6.5
into a pod through a PersistentVolumeClaim
Storage
MANNING
SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 761
Shelter Island, NY 11964
Email: [email protected]
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.
ISBN: 9781617293726
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – EBM – 22 21 20 19 18 17
To my parents,
who have always put their children’s needs above their own
brief contents
PART 1 OVERVIEW
1 ■ Introducing Kubernetes 1
2 ■ First steps with Docker and Kubernetes 25
vii
viii BRIEF CONTENTS
PART 1 OVERVIEW
1 Introducing Kubernetes 1
1.1 Understanding the need for a system like Kubernetes 2
Moving from monolithic apps to microservices 3 Providing a
■
ix
x CONTENTS
for kubectl 41
2.3 Running your first app on Kubernetes 42
Deploying your Node.js app 42 Accessing your web
■
Kubernetes dashboard 52
2.4 Summary 53
ReplicationController 103
4.3 Using ReplicaSets instead of ReplicationControllers 104
Comparing a ReplicaSet to a ReplicationController 105
Defining a ReplicaSet 105 Creating and examining a
■
complete 116
4.6 Scheduling Jobs to run periodically or once
in the future 116
Creating a CronJob 116 ■
Understanding how scheduled
jobs are run 117
4.7 Summary 118
service 134
5.3 Exposing services to external clients 134
Using a NodePort service 135 ■
Exposing a service through
an external load balancer 138 ■
Understanding the peculiarities
of external connections 141
5.4 Exposing services externally through an Ingress
resource 142
Creating an Ingress resource 144 Accessing the service
■
PersistentVolumes 183
6.6 Dynamic provisioning of PersistentVolumes 184
Defining the available storage types through StorageClass
resources 185 Requesting the storage class in a
■
guarantees 289
10.3 Using a StatefulSet 290
Creating the app and container image 290 Deploying the app
■
to a pod 351
12.2 Securing the cluster with role-based access control 353
Introducing the RBAC authorization plugin 353 Introducing ■
for persistent storage 427 Limiting the number of objects that can
■
scale-down 454
15.4 Summary 456
xxi
xxii PREFACE
absolutely the best way to get to know it in much greater detail than you’d learn as just
a user. As my knowledge of Kubernetes has expanded during the process and Kuber-
netes itself has evolved, I’ve constantly gone back to previous chapters I’ve written and
added additional information. I’m a perfectionist, so I’ll never really be absolutely sat-
isfied with the book, but I’m happy to hear that a lot of readers of the Manning Early
Access Program (MEAP) have found it to be a great guide to Kubernetes.
My aim is to get the reader to understand the technology itself and teach them
how to use the tooling to effectively and efficiently develop and deploy apps to Kuber-
netes clusters. In the book, I don’t put much emphasis on how to actually set up and
maintain a proper highly available Kubernetes cluster, but the last part should give
readers a very solid understanding of what such a cluster consists of and should allow
them to easily comprehend additional resources that deal with this subject.
I hope you’ll enjoy reading it, and that it teaches you how to get the most out of
the awesome system that is Kubernetes.
acknowledgments
Before I started writing this book, I had no clue how many people would be involved
in bringing it from a rough manuscript to a published piece of work. This means
there are a lot of people to thank.
First, I’d like to thank Erin Twohey for approaching me about writing this book,
and Michael Stephens from Manning, who had full confidence in my ability to write it
from day one. His words of encouragement early on really motivated me and kept me
motivated throughout the last year and a half.
I would also like to thank my initial development editor Andrew Warren, who
helped me get my first chapter out the door, and Elesha Hyde, who took over from
Andrew and worked with me all the way to the last chapter. Thank you for bearing
with me, even though I’m a difficult person to deal with, as I tend to drop off the
radar fairly regularly.
I would also like to thank Jeanne Boyarsky, who was the first reviewer to read and
comment on my chapters while I was writing them. Jeanne and Elesha were instrumen-
tal in making the book as nice as it hopefully is. Without their comments, the book
could never have received such good reviews from external reviewers and readers.
I’d like to thank my technical proofreader, Antonio Magnaghi, and of course all
my external reviewers: Al Krinker, Alessandro Campeis, Alexander Myltsev, Csaba Sari,
David DiMaria, Elias Rangel, Erisk Zelenka, Fabrizio Cucci, Jared Duncan, Keith
Donaldson, Michael Bright, Paolo Antinori, Peter Perlepes, and Tiklu Ganguly. Their
positive comments kept me going at times when I worried my writing was utterly awful
and completely useless. On the other hand, their constructive criticism helped improve
xxiii
xxiv ACKNOWLEDGMENTS
sections that I’d quickly thrown together without enough effort. Thank you for point-
ing out the hard-to-understand sections and suggesting ways of improving the book.
Also, thank you for asking the right questions, which made me realize I was wrong
about two or three things in the initial versions of the manuscript.
I also need to thank readers who bought the early version of the book through
Manning’s MEAP program and voiced their comments in the online forum or reached
out to me directly—especially Vimal Kansal, Paolo Patierno, and Roland Huß, who
noticed quite a few inconsistencies and other mistakes. And I would like to thank
everyone at Manning who has been involved in getting this book published. Before I
finish, I also need to thank my colleague and high school friend Aleš Justin, who
brought me to Red Hat, and my wonderful colleagues from the Cloud Enablement
team. If I hadn’t been at Red Hat or in the team, I wouldn’t have been the one to write
this book.
Lastly, I would like to thank my wife and my son, who were way too understanding
and supportive over the last 18 months, while I was locked in my office instead of
spending time with them.
Thank you all!
about this book
Kubernetes in Action aims to make you a proficient user of Kubernetes. It teaches you
virtually all the concepts you need to understand to effectively develop and run appli-
cations in a Kubernetes environment.
Before diving into Kubernetes, the book gives an overview of container technolo-
gies like Docker, including how to build containers, so that even readers who haven’t
used these technologies before can get up and running. It then slowly guides you
through most of what you need to know about Kubernetes—from basic concepts to
things hidden below the surface.
xxv
xxvi ABOUT THIS BOOK
explains how pods communicate through the network and how services per-
form load balancing across multiple pods.
■ Chapter 12 explains how to secure your Kubernetes API server, and by exten-
sion the cluster, using authentication and authorization.
■ Chapter 13 teaches you how pods can access the node’s resources and how a
cluster administrator can prevent pods from doing that.
■ Chapter 14 dives into constraining the computational resources each applica-
tion is allowed to consume, configuring the applications’ Quality of Service
guarantees, and monitoring the resource usage of individual applications. It
also teaches you how to prevent users from consuming too many resources.
■ Chapter 15 discusses how Kubernetes can be configured to automatically scale
the number of running replicas of your application, and how it can also increase
the size of your cluster when your current number of cluster nodes can’t accept
any additional applications.
■ Chapter 16 shows how to ensure pods are scheduled only to certain nodes or
how to prevent them from being scheduled to others. It also shows how to make
sure pods are scheduled together or how to prevent that from happening.
■ Chapter 17 teaches you how you should develop your applications to make them
good citizens of your cluster. It also gives you a few pointers on how to set up your
development and testing workflows to reduce friction during development.
■ Chapter 18 shows you how you can extend Kubernetes with your own custom
objects and how others have done it and created enterprise-class application
platforms.
As you progress through these chapters, you’ll not only learn about the individual
Kubernetes building blocks, but also progressively improve your knowledge of using
the kubectl command-line tool.
Within text paragraphs, some very common elements such as Pod, Replication-
Controller, ReplicaSet, DaemonSet, and so forth are set in regular font to avoid over-
proliferation of code font and help readability. In some places, “Pod” is capitalized
to refer to the Pod resource, and lowercased to refer to the actual group of running
containers.
All the samples in the book have been tested with Kubernetes version 1.8 running
in Google Kubernetes Engine and in a local cluster run with Minikube. The complete
source code and YAML manifests can be found at https://github.com/luksa/kubernetes-
in-action or downloaded from the publisher’s website at www.manning.com/books/
kubernetes-in-action.
Book forum
Purchase of Kubernetes in Action includes free access to a private web forum run by
Manning Publications where you can make comments about the book, ask technical
questions, and receive help from the author and from other users. To access the
forum, go to https://forums.manning.com/forums/kubernetes-in-action. You can also
learn more about Manning’s forums and the rules of conduct at https://forums
.manning.com/forums/about.
Manning’s commitment to our readers is to provide a venue where a meaningful
dialogue between individual readers and between readers and the author can take
place. It is not a commitment to any specific amount of participation on the part of
the author, whose contribution to the forum remains voluntary (and unpaid). We sug-
gest you try asking the author some challenging questions lest his interest stray! The
forum and the archives of previous discussions will be accessible from the publisher’s
website as long as the book is in print.
xxix
about the cover illustration
The figure on the cover of Kubernetes in Action is a “Member of the Divan,” the Turkish
Council of State or governing body. The illustration is taken from a collection of cos-
tumes of the Ottoman Empire published on January 1, 1802, by William Miller of Old
Bond Street, London. The title page is missing from the collection and we have been
unable to track it down to date. The book’s table of contents identifies the figures in
both English and French, and each illustration bears the names of two artists who
worked on it, both of whom would no doubt be surprised to find their art gracing the
front cover of a computer programming book ... 200 years later.
The collection was purchased by a Manning editor at an antiquarian flea market in
the “Garage” on West 26th Street in Manhattan. The seller was an American based in
Ankara, Turkey, and the transaction took place just as he was packing up his stand for
the day. The Manning editor didn’t have on his person the substantial amount of cash
that was required for the purchase, and a credit card and check were both politely
turned down. With the seller flying back to Ankara that evening, the situation was get-
ting hopeless. What was the solution? It turned out to be nothing more than an old-
fashioned verbal agreement sealed with a handshake. The seller proposed that the
money be transferred to him by wire, and the editor walked out with the bank infor-
mation on a piece of paper and the portfolio of images under his arm. Needless to say,
we transferred the funds the next day, and we remain grateful and impressed by this
unknown person’s trust in one of us. It recalls something that might have happened a
long time ago. We at Manning celebrate the inventiveness, the initiative, and, yes, the
fun of the computer business with book covers based on the rich diversity of regional
life of two centuries ago‚ brought back to life by the pictures from this collection.
xxx
Introducing Kubernetes
Years ago, most software applications were big monoliths, running either as a single
process or as a small number of processes spread across a handful of servers. These
legacy systems are still widespread today. They have slow release cycles and are
updated relatively infrequently. At the end of every release cycle, developers pack-
age up the whole system and hand it over to the ops team, who then deploys and
monitors it. In case of hardware failures, the ops team manually migrates it to the
remaining healthy servers.
Today, these big monolithic legacy applications are slowly being broken down
into smaller, independently running components called microservices. Because
1
2 CHAPTER 1 Introducing Kubernetes
microservices are decoupled from each other, they can be developed, deployed, updated,
and scaled individually. This enables you to change components quickly and as often as
necessary to keep up with today’s rapidly changing business requirements.
But with bigger numbers of deployable components and increasingly larger data-
centers, it becomes increasingly difficult to configure, manage, and keep the whole
system running smoothly. It’s much harder to figure out where to put each of those
components to achieve high resource utilization and thereby keep the hardware costs
down. Doing all this manually is hard work. We need automation, which includes
automatic scheduling of those components to our servers, automatic configuration,
supervision, and failure-handling. This is where Kubernetes comes in.
Kubernetes enables developers to deploy their applications themselves and as
often as they want, without requiring any assistance from the operations (ops) team.
But Kubernetes doesn’t benefit only developers. It also helps the ops team by automat-
ically monitoring and rescheduling those apps in the event of a hardware failure. The
focus for system administrators (sysadmins) shifts from supervising individual apps to
mostly supervising and managing Kubernetes and the rest of the infrastructure, while
Kubernetes itself takes care of the apps.
NOTE Kubernetes is Greek for pilot or helmsman (the person holding the
ship’s steering wheel). People pronounce Kubernetes in a few different ways.
Many pronounce it as Koo-ber-nay-tace, while others pronounce it more like
Koo-ber-netties. No matter which form you use, people will understand what
you mean.
Kubernetes abstracts away the hardware infrastructure and exposes your whole data-
center as a single enormous computational resource. It allows you to deploy and run
your software components without having to know about the actual servers under-
neath. When deploying a multi-component application through Kubernetes, it selects
a server for each component, deploys it, and enables it to easily find and communi-
cate with all the other components of your application.
This makes Kubernetes great for most on-premises datacenters, but where it starts
to shine is when it’s used in the largest datacenters, such as the ones built and oper-
ated by cloud providers. Kubernetes allows them to offer developers a simple platform
for deploying and running any type of application, while not requiring the cloud pro-
vider’s own sysadmins to know anything about the tens of thousands of apps running
on their hardware.
With more and more big companies accepting the Kubernetes model as the best
way to run apps, it’s becoming the standard way of running distributed apps both in
the cloud, as well as on local on-premises infrastructure.
and of the changes in the infrastructure that runs those apps. Understanding these
changes will help you better see the benefits of using Kubernetes and container tech-
nologies such as Docker.
Three instances of
the same component
DEPLOYING MICROSERVICES
As always, microservices also have drawbacks. When your system consists of only a
small number of deployable components, managing those components is easy. It’s
trivial to decide where to deploy each component, because there aren’t that many
choices. When the number of those components increases, deployment-related deci-
sions become increasingly difficult because not only does the number of deployment
combinations increase, but the number of inter-dependencies between the compo-
nents increases by an even greater factor.
Microservices perform their work together as a team, so they need to find and talk
to each other. When deploying them, someone or something needs to configure all of
them properly to enable them to work together as a single system. With increasing
numbers of microservices, this becomes tedious and error-prone, especially when you
consider what the ops/sysadmin teams need to do when a server fails.
Microservices also bring other problems, such as making it hard to debug and trace
execution calls, because they span multiple processes and machines. Luckily, these
problems are now being addressed with distributed tracing systems such as Zipkin.
UNDERSTANDING THE DIVERGENCE OF ENVIRONMENT REQUIREMENTS
As I’ve already mentioned, components in a microservices architecture aren’t only
deployed independently, but are also developed that way. Because of their indepen-
dence and the fact that it’s common to have separate teams developing each compo-
nent, nothing impedes each team from using different libraries and replacing them
whenever the need arises. The divergence of dependencies between application com-
ponents, like the one shown in figure 1.3, where applications require different ver-
sions of the same libraries, is inevitable.
Requires libraries
Library A Library C
Requires libraries v1.0 v1.1
Library B
Library A Library B Library C v2.4
Library A Library C
v1.0 v2.4 v1.1 v2.2 v2.0
Library Y
Library X v3.2
Library X Library Y Library X v1.4 Library Y
v1.4 v3.2 v2.3 v4.0
Figure 1.3 Multiple applications running on the same host may have conflicting dependencies.
6 CHAPTER 1 Introducing Kubernetes
application, as shown in figure 1.4. The end-result is that you can fit many more appli-
cations on the same bare-metal machine.
Host OS Host OS
Figure 1.4 Using VMs to isolate groups of applications vs. isolating individual apps with containers
When you run three VMs on a host, you have three completely separate operating sys-
tems running on and sharing the same bare-metal hardware. Underneath those VMs
is the host’s OS and a hypervisor, which divides the physical hardware resources into
smaller sets of virtual resources that can be used by the operating system inside each
VM. Applications running inside those VMs perform system calls to the guest OS’ ker-
nel in the VM, and the kernel then performs x86 instructions on the host’s physical
CPU through the hypervisor.
NOTE Two types of hypervisors exist. Type 1 hypervisors don’t use a host OS,
while Type 2 do.
Containers, on the other hand, all perform system calls on the exact same kernel run-
ning in the host OS. This single kernel is the only one performing x86 instructions on
the host’s CPU. The CPU doesn’t need to do any kind of virtualization the way it does
with VMs (see figure 1.5).
The main benefit of virtual machines is the full isolation they provide, because
each VM runs its own Linux kernel, while containers all call out to the same kernel,
which can clearly pose a security risk. If you have a limited amount of hardware
resources, VMs may only be an option when you have a small number of processes that
10 CHAPTER 1 Introducing Kubernetes
VM 1 VM 2 VM 3
Hypervisor
Physical CPU
Kernel
you want to isolate. To run greater numbers of isolated processes on the same
machine, containers are a much better choice because of their low overhead. Remem-
ber, each VM runs its own set of system services, while containers don’t, because they
all run in the same OS. That also means that to run a container, nothing needs to be
booted up, as is the case in VMs. A process run in a container starts up immediately.
Introducing container technologies 11
Each namespace kind is used to isolate a certain group of resources. For example, the
UTS namespace determines what hostname and domain name the process running
inside that namespace sees. By assigning two different UTS namespaces to a pair of
processes, you can make them see different local hostnames. In other words, to the
two processes, it will appear as though they are running on two different machines (at
least as far as the hostname is concerned).
Likewise, what Network namespace a process belongs to determines which net-
work interfaces the application running inside the process sees. Each network inter-
face belongs to exactly one namespace, but can be moved from one namespace to
another. Each container uses its own Network namespace, and therefore each con-
tainer sees its own set of network interfaces.
This should give you a basic idea of how namespaces are used to isolate applica-
tions running in containers from each other.
LIMITING RESOURCES AVAILABLE TO A PROCESS
The other half of container isolation deals with limiting the amount of system
resources a container can consume. This is achieved with cgroups, a Linux kernel fea-
ture that limits the resource usage of a process (or a group of processes). A process
can’t use more than the configured amount of CPU, memory, network bandwidth,
12 CHAPTER 1 Introducing Kubernetes
and so on. This way, processes cannot hog resources reserved for other processes,
which is similar to when each process runs on a separate machine.
Image
Image
Developer
Docker
Docker Image
VM 1 VM 2 VM 3
Hypervisor
Host OS
Container 2
Container 3
Container 4
Container 5
Container 6
You’ll notice that apps A and B have access to the same binaries and libraries both
when running in a VM and when running as two separate containers. In the VM, this
is obvious, because both apps see the same filesystem (that of the VM). But we said
Introducing container technologies 15
that each container has its own isolated filesystem. How can both app A and app B
share the same files?
UNDERSTANDING IMAGE LAYERS
I’ve already said that Docker images are composed of layers. Different images can con-
tain the exact same layers because every Docker image is built on top of another
image and two different images can both use the same parent image as their base.
This speeds up the distribution of images across the network, because layers that have
already been transferred as part of the first image don’t need to be transferred again
when transferring the other image.
But layers don’t only make distribution more efficient, they also help reduce the
storage footprint of images. Each layer is only stored once. Two containers created
from two images based on the same base layers can therefore read the same files, but
if one of them writes over those files, the other one doesn’t see those changes. There-
fore, even if they share files, they’re still isolated from each other. This works because
container image layers are read-only. When a container is run, a new writable layer is
created on top of the layers in the image. When the process in the container writes to
a file located in one of the underlying layers, a copy of the whole file is created in the
top-most layer and the process writes to the copy.
UNDERSTANDING THE PORTABILITY LIMITATIONS OF CONTAINER IMAGES
In theory, a container image can be run on any Linux machine running Docker, but
one small caveat exists—one related to the fact that all containers running on a host use
the host’s Linux kernel. If a containerized application requires a specific kernel version,
it may not work on every machine. If a machine runs a different version of the Linux
kernel or doesn’t have the same kernel modules available, the app can’t run on it.
While containers are much more lightweight compared to VMs, they impose cer-
tain constraints on the apps running inside them. VMs have no such constraints,
because each VM runs its own kernel.
And it’s not only about the kernel. It should also be clear that a containerized app
built for a specific hardware architecture can only run on other machines that have
the same architecture. You can’t containerize an application built for the x86 architec-
ture and expect it to run on an ARM-based machine because it also runs Docker. You
still need a VM for that.
Like Docker, rkt is a platform for running containers. It puts a strong emphasis on
security, composability, and conforming to open standards. It uses the OCI container
image format and can even run regular Docker container images.
This book focuses on using Docker as the container runtime for Kubernetes,
because it was initially the only one supported by Kubernetes. Recently, Kubernetes
has also started supporting rkt, as well as others, as the container runtime.
The reason I mention rkt at this point is so you don’t make the mistake of thinking
Kubernetes is a container orchestration system made specifically for Docker-based
containers. In fact, over the course of this book, you’ll realize that the essence of
Kubernetes isn’t orchestrating containers. It’s much more. Containers happen to be
the best way to run apps on different cluster nodes. With that in mind, let’s finally dive
into the core of what this book is all about—Kubernetes.
same server, which is critical when you run applications for completely different orga-
nizations on the same hardware. This is of paramount importance for cloud provid-
ers, because they strive for the best possible utilization of their hardware while still
having to maintain complete isolation of hosted applications.
Kubernetes enables you to run your software applications on thousands of com-
puter nodes as if all those nodes were a single, enormous computer. It abstracts away
the underlying infrastructure and, by doing so, simplifies development, deployment,
and management for both development and the operations teams.
Deploying applications through Kubernetes is always the same, whether your clus-
ter contains only a couple of nodes or thousands of them. The size of the cluster
makes no difference at all. Additional cluster nodes simply represent an additional
amount of resources available to deployed apps.
UNDERSTANDING THE CORE OF WHAT KUBERNETES DOES
Figure 1.8 shows the simplest possible view of a Kubernetes system. The system is com-
posed of a master node and any number of worker nodes. When the developer sub-
mits a list of apps to the master, Kubernetes deploys them to the cluster of worker
nodes. What node a component lands on doesn’t (and shouldn’t) matter—neither to
the developer nor to the system administrator.
Developer 1x
5x
Kubernetes
master
2x
Figure 1.8 Kubernetes exposes the whole datacenter as a single deployment platform.
The developer can specify that certain apps must run together and Kubernetes will
deploy them on the same worker node. Others will be spread around the cluster, but
they can talk to each other in the same way, regardless of where they’re deployed.
HELPING DEVELOPERS FOCUS ON THE CORE APP FEATURES
Kubernetes can be thought of as an operating system for the cluster. It relieves appli-
cation developers from having to implement certain infrastructure-related services
into their apps; instead they rely on Kubernetes to provide these services. This includes
things such as service discovery, scaling, load-balancing, self-healing, and even leader
18 CHAPTER 1 Introducing Kubernetes
election. Application developers can therefore focus on implementing the actual fea-
tures of the applications and not waste time figuring out how to integrate them with
the infrastructure.
HELPING OPS TEAMS ACHIEVE BETTER RESOURCE UTILIZATION
Kubernetes will run your containerized app somewhere in the cluster, provide infor-
mation to its components on how to find each other, and keep all of them running.
Because your application doesn’t care which node it’s running on, Kubernetes can
relocate the app at any time, and by mixing and matching apps, achieve far better
resource utilization than is possible with manual scheduling.
Figure 1.9 shows the components running on these two sets of nodes. I’ll explain
them next.
API server
etcd
Worker node(s)
Controller
Scheduler
Manager Kubelet kube-proxy
Container Runtime
The Scheduler, which schedules your apps (assigns a worker node to each deploy-
able component of your application)
The Controller Manager, which performs cluster-level functions, such as repli-
cating components, keeping track of worker nodes, handling node failures,
and so on
etcd, a reliable distributed data store that persistently stores the cluster
configuration.
The components of the Control Plane hold and control the state of the cluster, but
they don’t run your applications. This is done by the (worker) nodes.
THE NODES
The worker nodes are the machines that run your containerized applications. The
task of running, monitoring, and providing services to your applications is done by
the following components:
Docker, rkt, or another container runtime, which runs your containers
The Kubelet, which talks to the API server and manages containers on its node
The Kubernetes Service Proxy (kube-proxy), which load-balances network traffic
between application components
We’ll explain all these components in detail in chapter 11. I’m not a fan of explaining
how things work before first explaining what something does and teaching people to
use it. It’s like learning to drive a car. You don’t want to know what’s under the hood.
You first want to learn how to drive it from point A to point B. Only after you learn
how to do that do you become interested in how a car makes that possible. After all,
knowing what’s under the hood may someday help you get the car moving again after
it breaks down and leaves you stranded at the side of the road.
at that moment. The Kubelet on those nodes then instructs the Container Runtime
(Docker, for example) to pull the required container images and run the containers.
Examine figure 1.10 to gain a better understanding of how applications are
deployed in Kubernetes. The app descriptor lists four containers, grouped into three
sets (these sets are called pods; we’ll explain what they are in chapter 3). The first two
pods each contain only a single container, whereas the last one contains two. That
means both containers need to run co-located and shouldn’t be isolated from each
other. Next to each pod, you also see a number representing the number of replicas
of each pod that need to run in parallel. After submitting the descriptor to Kuberne-
tes, it will schedule the specified number of replicas of each pod to the available
worker nodes. The Kubelets on the nodes will then tell Docker to pull the container
images from the image registry and run the containers.
Worker nodes
Image registry
...
Docker Docker
5x Control Plane
(master)
...
Docker Docker
2x
App descriptor
Legend: ...
Docker Docker
Container image Multiple containers
running “together”
(not fully isolated) Kubelet kube-proxy Kubelet kube-proxy
Container
Figure 1.10 A basic overview of the Kubernetes architecture and an application running on top of it
you specify that you always want five instances of a web server running, Kubernetes will
always keep exactly five instances running. If one of those instances stops working
properly, like when its process crashes or when it stops responding, Kubernetes will
restart it automatically.
Similarly, if a whole worker node dies or becomes inaccessible, Kubernetes will
select new nodes for all the containers that were running on the node and run them
on the newly selected nodes.
SCALING THE NUMBER OF COPIES
While the application is running, you can decide you want to increase or decrease the
number of copies, and Kubernetes will spin up additional ones or stop the excess
ones, respectively. You can even leave the job of deciding the optimal number of cop-
ies to Kubernetes. It can automatically keep adjusting the number, based on real-time
metrics, such as CPU load, memory consumption, queries per second, or any other
metric your app exposes.
HITTING A MOVING TARGET
We’ve said that Kubernetes may need to move your containers around the cluster.
This can occur when the node they were running on has failed or because they were
evicted from a node to make room for other containers. If the container is providing a
service to external clients or other containers running in the cluster, how can they use
the container properly if it’s constantly moving around the cluster? And how can cli-
ents connect to containers providing a service when those containers are replicated
and spread across the whole cluster?
To allow clients to easily find containers that provide a specific service, you can tell
Kubernetes which containers provide the same service and Kubernetes will expose all
of them at a single static IP address and expose that address to all applications run-
ning in the cluster. This is done through environment variables, but clients can also
look up the service IP through good old DNS. The kube-proxy will make sure connec-
tions to the service are load balanced across all the containers that provide the service.
The IP address of the service stays constant, so clients can always connect to its con-
tainers, even when they’re moved around the cluster.
In essence, all the nodes are now a single bunch of computational resources that
are waiting for applications to consume them. A developer doesn’t usually care what
kind of server the application is running on, as long as the server can provide the
application with adequate system resources.
Certain cases do exist where the developer does care what kind of hardware the
application should run on. If the nodes are heterogeneous, you’ll find cases when you
want certain apps to run on nodes with certain capabilities and run other apps on oth-
ers. For example, one of your apps may require being run on a system with SSDs
instead of HDDs, while other apps run fine on HDDs. In such cases, you obviously
want to ensure that particular app is always scheduled to a node with an SSD.
Without using Kubernetes, the sysadmin would select one specific node that has an
SSD and deploy the app there. But when using Kubernetes, instead of selecting a spe-
cific node where your app should be run, it’s more appropriate to tell Kubernetes to
only choose among nodes with an SSD. You’ll learn how to do that in chapter 3.
ACHIEVING BETTER UTILIZATION OF HARDWARE
By setting up Kubernetes on your servers and using it to run your apps instead of run-
ning them manually, you’ve decoupled your app from the infrastructure. When you
tell Kubernetes to run your application, you’re letting it choose the most appropriate
node to run your application on based on the description of the application’s
resource requirements and the available resources on each node.
By using containers and not tying the app down to a specific node in your cluster,
you’re allowing the app to freely move around the cluster at any time, so the different
app components running on the cluster can be mixed and matched to be packed
tightly onto the cluster nodes. This ensures the node’s hardware resources are utilized
as best as possible.
The ability to move applications around the cluster at any time allows Kubernetes
to utilize the infrastructure much better than what you can achieve manually. Humans
aren’t good at finding optimal combinations, especially when the number of all possi-
ble options is huge, such as when you have many application components and many
server nodes they can be deployed on. Computers can obviously perform this work
much better and faster than humans.
HEALTH CHECKING AND SELF-HEALING
Having a system that allows moving an application across the cluster at any time is also
valuable in the event of server failures. As your cluster size increases, you’ll deal with
failing computer components ever more frequently.
Kubernetes monitors your app components and the nodes they run on and auto-
matically reschedules them to other nodes in the event of a node failure. This frees
the ops team from having to migrate app components manually and allows the team
to immediately focus on fixing the node itself and returning it to the pool of available
hardware resources instead of focusing on relocating the app.
If your infrastructure has enough spare resources to allow normal system opera-
tion even without the failed node, the ops team doesn’t even need to react to the failure
Summary 23
immediately, such as at 3 a.m. They can sleep tight and deal with the failed node
during regular work hours.
AUTOMATIC SCALING
Using Kubernetes to manage your deployed applications also means the ops team
doesn’t need to constantly monitor the load of individual applications to react to sud-
den load spikes. As previously mentioned, Kubernetes can be told to monitor the
resources used by each application and to keep adjusting the number of running
instances of each application.
If Kubernetes is running on cloud infrastructure, where adding additional nodes is
as easy as requesting them through the cloud provider’s API, Kubernetes can even
automatically scale the whole cluster size up or down based on the needs of the
deployed applications.
SIMPLIFYING APPLICATION DEVELOPMENT
The features described in the previous section mostly benefit the operations team. But
what about the developers? Does Kubernetes bring anything to their table? It defi-
nitely does.
If you turn back to the fact that apps run in the same environment both during
development and in production, this has a big effect on when bugs are discovered. We
all agree the sooner you discover a bug, the easier it is to fix it, and fixing it requires
less work. It’s the developers who do the fixing, so this means less work for them.
Then there’s the fact that developers don’t need to implement features that they
would usually implement. This includes discovery of services and/or peers in a clustered
application. Kubernetes does this instead of the app. Usually, the app only needs to look
up certain environment variables or perform a DNS lookup. If that’s not enough, the
application can query the Kubernetes API server directly to get that and/or other infor-
mation. Querying the Kubernetes API server like that can even save developers from
having to implement complicated mechanisms such as leader election.
As a final example of what Kubernetes brings to the table, you also need to con-
sider the increase in confidence developers will feel knowing that when a new version
of their app is going to be rolled out, Kubernetes can automatically detect if the new
version is bad and stop its rollout immediately. This increase in confidence usually
accelerates the continuous delivery of apps, which benefits the whole organization.
1.4 Summary
In this introductory chapter, you’ve seen how applications have changed in recent
years and how they can now be harder to deploy and manage. We’ve introduced
Kubernetes and shown how it, together with Docker and other container platforms,
helps deploy and manage applications and the infrastructure they run on. You’ve
learned that
Monolithic apps are easier to deploy, but harder to maintain over time and
sometimes impossible to scale.
24 CHAPTER 1 Introducing Kubernetes
Before you start learning about Kubernetes concepts in detail, let’s see how to cre-
ate a simple application, package it into a container image, and run it in a managed
Kubernetes cluster (in Google Kubernetes Engine) or in a local single-node cluster.
This should give you a slightly better overview of the whole Kubernetes system and
will make it easier to follow the next few chapters, where we’ll go over the basic
building blocks and concepts in Kubernetes.
25
26 CHAPTER 2 First steps with Docker and Kubernetes
This doesn’t look that impressive, but when you consider that the whole “app” was
downloaded and executed with a single command, without you having to install that
app or anything else, you’ll agree it’s awesome. In your case, the app was a single execut-
able (busybox), but it might as well have been an incredibly complex app with many
dependencies. The whole process of setting up and running the app would have been
exactly the same. What’s also important is that the app was executed inside a container,
completely isolated from all the other processes running on your machine.
UNDERSTANDING WHAT HAPPENS BEHIND THE SCENES
Figure 2.1 shows exactly what happened when you performed the docker run com-
mand. First, Docker checked to see if the busybox:latest image was already present
on your local machine. It wasn’t, so Docker pulled it from the Docker Hub registry at
http://docker.io. After the image was downloaded to your machine, Docker created a
container from that image and ran the command inside it. The echo command
printed the text to STDOUT and then the process terminated and the container
stopped.
3. Docker pulls
busybox image
1. docker run busybox from registry (if not
echo "Hello world" available locally)
Docker busybox busybox
4. Docker runs
echo "Hello world"
in isolated container
Docker Hub
Local machine
Figure 2.1 Running echo “Hello world” in a container based on the busybox container image
It should be clear what this code does. It starts up an HTTP server on port 8080. The
server responds with an HTTP response status code 200 OK and the text "You’ve hit
<hostname>" to every request. The request handler also logs the client’s IP address to
the standard output, which you’ll need later.
NOTE The returned hostname is the server’s actual hostname, not the one
the client sends in the HTTP request’s Host header.
You could now download and install Node.js and test your app directly, but this isn’t
necessary, because you’ll use Docker to package the app into a container image and
Creating, running, and sharing a container image 29
Listing 2.3 A Dockerfile for building a container image for your app
FROM node:7
ADD app.js /app.js
ENTRYPOINT ["node", "app.js"]
The FROM line defines the container image you’ll use as a starting point (the base
image you’re building on top of). In your case, you’re using the node container image,
tag 7. In the second line, you’re adding your app.js file from your local directory into
the root directory in the image, under the same name (app.js). Finally, in the third
line, you’re defining what command should be executed when somebody runs the
image. In your case, the command is node app.js.
Figure 2.2 shows what happens during the build process. You’re telling Docker to
build an image called kubia based on the contents of the current directory (note the
dot at the end of the build command). Docker will look for the Dockerfile in the direc-
tory and build the image based on the instructions in the file.
30 CHAPTER 2 First steps with Docker and Kubernetes
1. docker build
kubia .
Docker client
Dockerfile app.js
4. Build new
kubia:latest image
Docker Hub
Local machine
TIP Don’t include any unnecessary files in the build directory, because they’ll
slow down the build process—especially when the Docker daemon is on a
remote machine.
During the build process, Docker will first pull the base image (node:7) from the pub-
lic image repository (Docker Hub), unless the image has already been pulled and is
stored on your machine.
UNDERSTANDING IMAGE LAYERS
An image isn’t a single, big, binary blob, but is composed of multiple layers, which you
may have already noticed when running the busybox example (there were multiple
Pull complete lines—one for each layer). Different images may share several layers,
Creating, running, and sharing a container image 31
which makes storing and transferring images much more efficient. For example, if
you create multiple images based on the same base image (such as node:7 in the exam-
ple), all the layers comprising the base image will be stored only once. Also, when pull-
ing an image, Docker will download each layer individually. Several layers may already
be stored on your machine, so Docker will only download those that aren’t.
You may think that each Dockerfile creates only a single new layer, but that’s not
the case. When building an image, a new layer is created for each individual command
in the Dockerfile. During the build of your image, after pulling all the layers of the base
image, Docker will create a new layer on top of them and add the app.js file into it.
Then it will create yet another layer that will specify the command that should be exe-
cuted when the image is run. This last layer will then be tagged as kubia:latest. This is
shown in figure 2.3, which also shows how a different image called other:latest would
use the same layers of the Node.js image as your own image does.
CMD node
Figure 2.3 Container images are composed of layers that can be shared among different images.
When the build process completes, you have a new image stored locally. You can see it
by telling Docker to list all locally stored images, as shown in the following listing.
$ docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
kubia latest d30ecc7419e7 1 minute ago 637.1 MB
...
the Dockerfile and rebuild the image any time, without having to manually retype all
the commands again.
This tells Docker to run a new container called kubia-container from the kubia
image. The container will be detached from the console (-d flag), which means it will
run in the background. Port 8080 on the local machine will be mapped to port 8080
inside the container (-p 8080:8080 option), so you can access the app through
http://localhost:8080.
If you’re not running the Docker daemon on your local machine (if you’re using a
Mac or Windows, the daemon is running inside a VM), you’ll need to use the host-
name or IP of the VM running the daemon instead of localhost. You can look it up
through the DOCKER_HOST environment variable.
ACCESSING YOUR APP
Now try to access your application at http://localhost:8080 (be sure to replace local-
host with the hostname or IP of the Docker host if necessary):
$ curl localhost:8080
You’ve hit 44d76963e8e1
That’s the response from your app. Your tiny application is now running inside a con-
tainer, isolated from everything else. As you can see, it’s returning 44d76963e8e1 as its
hostname, and not the actual hostname of your host machine. The hexadecimal num-
ber is the ID of the Docker container.
LISTING ALL RUNNING CONTAINERS
Let’s list all running containers in the following listing, so you can examine the list
(I’ve edited the output to make it more readable—imagine the last two lines as the
continuation of the first two).
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED ...
44d76963e8e1 kubia:latest "/bin/sh -c 'node ap 6 minutes ago ...
A single container is running. For each container, Docker prints out its ID and name,
the image used to run the container, and the command that’s executing inside the
container.
Creating, running, and sharing a container image 33
Docker will print out a long JSON containing low-level information about the con-
tainer.
This will run bash inside the existing kubia-container container. The bash process
will have the same Linux namespaces as the main container process. This allows you
to explore the container from within and see how Node.js and your app see the system
when running inside the container. The -it option is shorthand for two options:
-i, which makes sure STDIN is kept open. You need this for entering com-
mands into the shell.
-t, which allocates a pseudo terminal (TTY).
You need both if you want the use the shell like you’re used to. (If you leave out the
first one, you can’t type any commands, and if you leave out the second one, the com-
mand prompt won’t be displayed and some commands will complain about the TERM
variable not being set.)
EXPLORING THE CONTAINER FROM WITHIN
Let’s see how to use the shell in the following listing to see the processes running in
the container.
root@44d76963e8e1:/# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 676380 16504 ? Sl 12:31 0:00 node app.js
root 10 0.0 0.0 20216 1924 ? Ss 12:31 0:00 bash
root 19 0.0 0.0 17492 1136 ? R+ 12:38 0:00 ps aux
You see only three processes. You don’t see any other processes from the host OS.
34 CHAPTER 2 First steps with Docker and Kubernetes
NOTE If you’re using a Mac or Windows, you’ll need to log into the VM where
the Docker daemon is running to see these processes.
This proves that processes running in the container are running in the host OS. If you
have a keen eye, you may have noticed that the processes have different IDs inside the
container vs. on the host. The container is using its own PID Linux namespace and
has a completely isolated process tree, with its own sequence of numbers.
THE CONTAINER’S FILESYSTEM IS ALSO ISOLATED
Like having an isolated process tree, each container also has an isolated filesystem.
Listing the contents of the root directory inside the container will only show the files
in the container and will include all the files that are in the image plus any files that
are created while the container is running (log files and similar), as shown in the fol-
lowing listing.
root@44d76963e8e1:/# ls /
app.js boot etc lib media opt root sbin sys usr
bin dev home lib64 mnt proc run srv tmp var
It contains the app.js file and other system directories that are part of the node:7 base
image you’re using. To exit the container, you exit the shell by running the exit com-
mand and you’ll be returned to your host machine (like logging out of an ssh session,
for example).
TIP Entering a running container like this is useful when debugging an app
running in a container. When something’s wrong, the first thing you’ll want
to explore is the actual state of the system your application sees. Keep in mind
that an application will not only see its own unique filesystem, but also pro-
cesses, users, hostname, and network interfaces.
This will stop the main process running in the container and consequently stop the
container, because no other processes are running inside the container. The con-
tainer itself still exists and you can see it with docker ps -a. The -a option prints out
all the containers, those running and those that have been stopped. To truly remove a
container, you need to remove it with the docker rm command:
$ docker rm kubia-container
This deletes the container. All its contents are removed and it can’t be started again.
This doesn’t rename the tag; it creates an additional tag for the same image. You can
confirm this by listing the images stored on your system with the docker images com-
mand, as shown in the following listing.
As you can see, both kubia and luksa/kubia point to the same image ID, so they’re in
fact one single image with two tags.
36 CHAPTER 2 First steps with Docker and Kubernetes
It doesn’t get much simpler than that. And the best thing about this is that your appli-
cation will have the exact same environment every time and everywhere it’s run. If it
ran fine on your machine, it should run as well on every other Linux machine. No
need to worry about whether the host machine has Node.js installed or not. In fact,
even if it does, your app won’t use it, because it will use the one installed inside the
image.
cluster using virtual machines, but I suggest you try it only after reading the first 11
chapters of the book.
Another option is to install Kubernetes on Amazon’s AWS (Amazon Web Services).
For this, you can look at the kops tool, which is built on top of kubeadm mentioned in
the previous paragraph, and is available at http://github.com/kubernetes/kops. It
helps you deploy production-grade, highly available Kubernetes clusters on AWS and
will eventually support other platforms as well (Google Kubernetes Engine, VMware,
vSphere, and so on).
On Linux, you download a different release (replace “darwin” with “linux” in the
URL). On Windows, you can download the file manually, rename it to minikube.exe,
and put it onto your path. Minikube runs Kubernetes inside a VM run through either
VirtualBox or KVM, so you also need to install one of them before you can start the
Minikube cluster.
STARTING A KUBERNETES CLUSTER WITH MINIKUBE
Once you have Minikube installed locally, you can immediately start up the Kuberne-
tes cluster with the command in the following listing.
$ minikube start
Starting local Kubernetes cluster...
Starting VM...
SSH-ing files into VM...
...
Kubectl is now configured to use the cluster.
38 CHAPTER 2 First steps with Docker and Kubernetes
Starting the cluster takes more than a minute, so don’t interrupt the command before
it completes.
INSTALLING THE KUBERNETES CLIENT (KUBECTL)
To interact with Kubernetes, you also need the kubectl CLI client. Again, all you need
to do is download it and put it on your path. The latest stable release for OSX, for
example, can be downloaded and installed with the following command:
$ curl -LO https://storage.googleapis.com/kubernetes-release/release
➥ /$(curl -s https://storage.googleapis.com/kubernetes-release/release
➥ /stable.txt)/bin/darwin/amd64/kubectl
➥ && chmod +x kubectl
➥ && sudo mv kubectl /usr/local/bin/
To download kubectl for Linux or Windows, replace darwin in the URL with either
linux or windows.
$ kubectl cluster-info
Kubernetes master is running at https://192.168.99.100:8443
KubeDNS is running at https://192.168.99.100:8443/api/v1/proxy/...
kubernetes-dashboard is running at https://192.168.99.100:8443/api/v1/...
This shows the cluster is up. It shows the URLs of the various Kubernetes components,
including the API server and the web console.
TIP You can run minikube ssh to log into the Minikube VM and explore it
from the inside. For example, you may want to see what processes are run-
ning on the node.
SETTING UP A GOOGLE CLOUD PROJECT AND DOWNLOADING THE NECESSARY CLIENT BINARIES
Before you can set up a new Kubernetes cluster, you need to set up your GKE environ-
ment. Because the process may change, I’m not listing the exact instructions here. To
get started, please follow the instructions at https://cloud.google.com/container-
engine/docs/before-you-begin.
Roughly, the whole procedure includes
1 Signing up for a Google account, in the unlikely case you don’t have one
already.
2 Creating a project in the Google Cloud Platform Console.
3 Enabling billing. This does require your credit card info, but Google provides a
12-month free trial. And they’re nice enough to not start charging automati-
cally after the free trial is over.)
4 Enabling the Kubernetes Engine API.
5 Downloading and installing Google Cloud SDK. (This includes the gcloud
command-line tool, which you’ll need to create a Kubernetes cluster.)
6 Installing the kubectl command-line tool with gcloud components install
kubectl.
NOTE Certain operations (the one in step 2, for example) may take a few
minutes to complete, so relax and grab a coffee in the meantime.
CREATING A KUBERNETES CLUSTER WITH THREE NODES
After completing the installation, you can create a Kubernetes cluster with three
worker nodes using the command shown in the following listing.
You should now have a running Kubernetes cluster with three worker nodes as shown
in figure 2.4. You’re using three nodes to help better demonstrate features that apply
to multiple nodes. You can use a smaller number of nodes, if you want.
GETTING AN OVERVIEW OF YOUR CLUSTER
To give you a basic idea of what your cluster looks like and how to interact with it, see
figure 2.4. Each node runs Docker, the Kubelet and the kube-proxy. You’ll interact
with the cluster through the kubectl command line client, which issues REST requests
to the Kubernetes API server running on the master node.
40 CHAPTER 2 First steps with Docker and Kubernetes
Kubernetes cluster
Worker nodes
Docker
Kubelet kube-proxy
gke-kubia-85f6-node-0rrx
REST call
kubectl REST API server Docker
Kubelet kube-proxy
Docker
Kubelet kube-proxy
gke-kubia-85f6-node-vs9f
Figure 2.4 How you’re interacting with your three-node Kubernetes cluster
The kubectl get command can list all kinds of Kubernetes objects. You’ll use it con-
stantly, but it usually shows only the most basic information for the listed objects.
TIP You can log into one of the nodes with gcloud compute ssh <node-name>
to explore what’s running on the node.
Setting up a Kubernetes cluster 41
I’m omitting the actual output of the describe command, because it’s fairly wide and
would be completely unreadable here in the book. The output shows the node’s sta-
tus, its CPU and memory data, system information, containers running on the node,
and much more.
In the previous kubectl describe example, you specified the name of the node
explicitly, but you could also have performed a simple kubectl describe node without
typing the node’s name and it would print out a detailed description of all the nodes.
TIP Running the describe and get commands without specifying the name
of the object comes in handy when only one object of a given type exists, so
you don’t waste time typing or copy/pasting the object’s name.
While we’re talking about reducing keystrokes, let me give you additional advice on
how to make working with kubectl much easier, before we move on to running your
first app in Kubernetes.
alias k=kubectl
NOTE You may already have the k executable if you used gcloud to set up the
cluster.
CONFIGURING TAB COMPLETION FOR KUBECTL
Even with a short alias such as k, you’ll still need to type way more than you’d like. Luck-
ily, the kubectl command can also output shell completion code for both the bash and
zsh shell. It doesn’t enable tab completion of only command names, but also of the
actual object names. For example, instead of having to write the whole node name in
the previous example, all you’d need to type is
To enable tab completion in bash, you’ll first need to install a package called bash-
completion and then run the following command (you’ll probably also want to add it
to ~/.bashrc or equivalent):
But there’s one caveat. When you run the preceding command, tab completion will
only work when you use the full kubectl name (it won’t work when you use the k
alias). To fix this, you need to transform the output of the kubectl completion com-
mand a bit:
NOTE Unfortunately, as I’m writing this, shell completion doesn’t work for
aliases on MacOS. You’ll have to use the full kubectl command name if you
want completion to work.
Now you’re all set to start interacting with your cluster without having to type too
much. You can finally run your first app on Kubernetes.
The --image=luksa/kubia part obviously specifies the container image you want to
run, and the --port=8080 option tells Kubernetes that your app is listening on port
8080. The last flag (--generator) does require an explanation, though. Usually, you
won’t use it, but you’re using it here so Kubernetes creates a ReplicationController
instead of a Deployment. You’ll learn what ReplicationControllers are later in the chap-
ter, but we won’t talk about Deployments until chapter 9. That’s why I don’t want
kubectl to create a Deployment yet.
As the previous command’s output shows, a ReplicationController called kubia
has been created. As already mentioned, we’ll see what that is later in the chapter. For
Running your first app on Kubernetes 43
now, let’s start from the bottom and focus on the container you created (you can
assume a container has been created, because you specified a container image in the
run command).
INTRODUCING PODS
You may be wondering if you can see your container in a list showing all the running
containers. Maybe something such as kubectl get containers? Well, that’s not exactly
how Kubernetes works. It doesn’t deal with individual containers directly. Instead, it
uses the concept of multiple co-located containers. This group of containers is called
a Pod.
A pod is a group of one or more tightly related containers that will always run
together on the same worker node and in the same Linux namespace(s). Each pod
is like a separate logical machine with its own IP, hostname, processes, and so on,
running a single application. The application can be a single process, running in a
single container, or it can be a main application process and additional supporting
processes, each running in its own container. All the containers in a pod will appear
to be running on the same logical machine, whereas containers in other pods, even
if they’re running on the same worker node, will appear to be running on a differ-
ent one.
To better understand the relationship between containers, pods, and nodes, exam-
ine figure 2.5. As you can see, each pod has its own IP and contains one or more con-
tainers, each running an application process. Pods are spread out across different
worker nodes.
Figure 2.5 The relationship between containers, pods, and physical worker nodes
LISTING PODS
Because you can’t list individual containers, since they’re not standalone Kubernetes
objects, can you list pods instead? Yes, you can. Let’s see how to tell kubectl to list
pods in the following listing.
44 CHAPTER 2 First steps with Docker and Kubernetes
This is your pod. Its status is still Pending and the pod’s single container is shown as
not ready yet (this is what the 0/1 in the READY column means). The reason why the
pod isn’t running yet is because the worker node the pod has been assigned to is
downloading the container image before it can run it. When the download is finished,
the pod’s container will be created and then the pod will transition to the Running
state, as shown in the following listing.
Listing 2.15 Listing pods again to see if the pod’s status has changed
To see more information about the pod, you can also use the kubectl describe pod
command, like you did earlier for one of the worker nodes. If the pod stays stuck in
the Pending status, it might be that Kubernetes can’t pull the image from the registry.
If you’re using your own image, make sure it’s marked as public on Docker Hub. To
make sure the image can be pulled successfully, try pulling the image manually with
the docker pull command on another machine.
UNDERSTANDING WHAT HAPPENED BEHIND THE SCENES
To help you visualize what transpired, look at figure 2.6. It shows both steps you had to
perform to get a container image running inside Kubernetes. First, you built the
image and pushed it to Docker Hub. This was necessary because building the image
on your local machine only makes it available on your local machine, but you needed
to make it accessible to the Docker daemons running on your worker nodes.
When you ran the kubectl command, it created a new ReplicationController
object in the cluster by sending a REST HTTP request to the Kubernetes API server.
The ReplicationController then created a new pod, which was then scheduled to one
of the worker nodes by the Scheduler. The Kubelet on that node saw that the pod was
scheduled to it and instructed Docker to pull the specified image from the registry
because the image wasn’t available locally. After downloading the image, Docker cre-
ated and ran the container.
The other two nodes are displayed to show context. They didn’t play any role in
the process, because the pod wasn’t scheduled to them.
DEFINITION The term scheduling means assigning the pod to a node. The
pod is run immediately, not at a time in the future as the term might lead you
to believe.
Running your first app on Kubernetes 45
Docker
Kubelet
gke-kubia-85f6-node-0rrx
Docker
Local dev
2. Image
machine Docker Hub Kubelet
luksa/kubia
1. docker push is pushed to
luksa/kubia Docker Hub gke-kubia-85f6-node-heo1
Docker
Master node(s)
LISTING SERVICES
The expose command’s output mentions a service called kubia-http. Services are
objects like Pods and Nodes, so you can see the newly created Service object by run-
ning the kubectl get services command, as shown in the following listing.
The list shows two services. Ignore the kubernetes service for now and take a close
look at the kubia-http service you created. It doesn’t have an external IP address yet,
because it takes time for the load balancer to be created by the cloud infrastructure
Kubernetes is running on. Once the load balancer is up, the external IP address of the
service should be displayed. Let’s wait a while and list the services again, as shown in
the following listing.
Listing 2.17 Listing services again to see if an external IP has been assigned
Aha, there’s the external IP. Your application is now accessible at http://104.155.74
.57:8080 from anywhere in the world.
Woohoo! Your app is now running somewhere in your three-node Kubernetes cluster
(or a single-node cluster if you’re using Minikube). If you don’t count the steps
required to set up the whole cluster, all it took was two simple commands to get your
app running and to make it accessible to users across the world.
Running your first app on Kubernetes 47
TIP When using Minikube, you can get the IP and port through which you
can access the service by running minikube service kubia-http.
If you look closely, you’ll see that the app is reporting the name of the pod as its host-
name. As already mentioned, each pod behaves like a separate independent machine
with its own IP address and hostname. Even though the application is running in
the worker node’s operating system, to the app it appears as though it’s running on
a separate machine dedicated to the app itself—no other processes are running
alongside it.
Pod: kubia-4jfyf
IP: 10.1.0.1
The list shows a single ReplicationController called kubia. The DESIRED column
shows the number of pod replicas you want the ReplicationController to keep,
whereas the CURRENT column shows the actual number of pods currently running. In
your case, you wanted to have a single replica of the pod running, and exactly one
replica is currently running.
INCREASING THE DESIRED REPLICA COUNT
To scale up the number of replicas of your pod, you need to change the desired
replica count on the ReplicationController like this:
$ kubectl scale rc kubia --replicas=3
replicationcontroller "kubia" scaled
You’ve now told Kubernetes to make sure three instances of your pod are always run-
ning. Notice that you didn’t instruct Kubernetes what action to take. You didn’t tell it
to add two more pods. You only set the new desired number of instances and let
Kubernetes determine what actions it needs to take to achieve the requested state.
This is one of the most fundamental Kubernetes principles. Instead of telling
Kubernetes exactly what actions it should perform, you’re only declaratively changing
the desired state of the system and letting Kubernetes examine the current actual
state and reconcile it with the desired state. This is true across all of Kubernetes.
SEEING THE RESULTS OF THE SCALE-OUT
Back to your replica count increase. Let’s list the ReplicationControllers again to see
the updated replica count:
$ kubectl get rc
NAME DESIRED CURRENT READY AGE
kubia 3 3 2 17m
Because the actual number of pods has already been increased to three (as evident
from the CURRENT column), listing all the pods should now show three pods instead
of one:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
kubia-hczji 1/1 Running 0 7s
kubia-iq9y6 0/1 Pending 0 7s
kubia-4jfyf 1/1 Running 0 18m
50 CHAPTER 2 First steps with Docker and Kubernetes
As you can see, three pods exist instead of one. Two are already running, one is still
pending, but should be ready in a few moments, as soon as the container image is
downloaded and the container is started.
As you can see, scaling an application is incredibly simple. Once your app is run-
ning in production and a need to scale the app arises, you can add additional
instances with a single command without having to install and run additional copies
manually.
Keep in mind that the app itself needs to support being scaled horizontally. Kuber-
netes doesn’t magically make your app scalable; it only makes it trivial to scale the app
up or down.
SEEING REQUESTS HIT ALL THREE PODS WHEN HITTING THE SERVICE
Because you now have multiple instances of your app running, let’s see what happens
if you hit the service URL again. Will you always hit the same app instance or not?
$ curl 104.155.74.57:8080
You’ve hit kubia-hczji
$ curl 104.155.74.57:8080
You’ve hit kubia-iq9y6
$ curl 104.155.74.57:8080
You’ve hit kubia-iq9y6
$ curl 104.155.74.57:8080
You’ve hit kubia-4jfyf
Requests are hitting different pods randomly. This is what services in Kubernetes do
when more than one pod instance backs them. They act as a load balancer standing in
front of multiple pods. When there’s only one pod, services provide a static address
for the single pod. Whether a service is backed by a single pod or a group of pods,
those pods come and go as they’re moved around the cluster, which means their IP
addresses change, but the service is always there at the same address. This makes it
easy for clients to connect to the pods, regardless of how many exist and how often
they change location.
VISUALIZING THE NEW STATE OF YOUR SYSTEM
Let’s visualize your system again to see what’s changed from before. Figure 2.8
shows the new state of your system. You still have a single service and a single
ReplicationController, but you now have three instances of your pod, all managed
by the ReplicationController. The service no longer sends all requests to a single
pod, but spreads them across all three pods as shown in the experiment with curl
in the previous section.
As an exercise, you can now try spinning up additional instances by increasing the
ReplicationController’s replica count even further and then scaling back down.
Running your first app on Kubernetes 51
Incoming Port
request 8080 Service: kubia-http
Internal IP: 10.3.246.185
External IP: 104.155.74.57
Figure 2.8 Three instances of a pod managed by the same ReplicationController and exposed
through a single service IP and port.
This shows, among other things, the node the pod has been scheduled to, the time
when it was started, the image(s) it’s running, and other useful information.
If you open this URL in a browser, you’re presented with a username and password
prompt. You’ll find the username and password by running the following command:
$ gcloud container clusters describe kubia | grep -E "(username|password):"
password: 32nENgreEJ632A12
username: admin The username and password
for the dashboard
$ minikube dashboard
The dashboard will open in your default browser. Unlike with GKE, you won’t need to
enter any credentials to access it.
2.4 Summary
Hopefully, this initial hands-on chapter has shown you that Kubernetes isn’t a compli-
cated platform to use, and you’re ready to learn in depth about all the things it can
provide. After reading this chapter, you should now know how to
Pull and run any publicly available container image
Package your apps into container images and make them available to anyone by
pushing the images to a remote image registry
54 CHAPTER 2 First steps with Docker and Kubernetes
The previous chapter should have given you a rough picture of the basic compo-
nents you create in Kubernetes and at least an outline of what they do. Now, we’ll
start reviewing all types of Kubernetes objects (or resources) in greater detail, so
you’ll understand when, how, and why to use each of them. We’ll start with pods,
because they’re the central, most important, concept in Kubernetes. Everything
else either manages, exposes, or is used by pods.
55
56 CHAPTER 3 Pods: running containers in Kubernetes
Container 1 Container 2
Container 2 Container 2
Pod 3
Node 1 Node 2
Figure 3.1 All containers of a pod run on the same node. A pod never spans two nodes.
Therefore, you need to run each process in its own container. That’s how Docker
and Kubernetes are meant to be used.
NOTE When containers of the same pod use separate PID namespaces, you
only see the container’s own processes when running ps aux in the container.
But when it comes to the filesystem, things are a little different. Because most of the
container’s filesystem comes from the container image, by default, the filesystem of
each container is fully isolated from other containers. However, it’s possible to have
them share file directories using a Kubernetes concept called a Volume, which we’ll
talk about in chapter 6.
UNDERSTANDING HOW CONTAINERS SHARE THE SAME IP AND PORT SPACE
One thing to stress here is that because containers in a pod run in the same Network
namespace, they share the same IP address and port space. This means processes run-
ning in containers of the same pod need to take care not to bind to the same port
numbers or they’ll run into port conflicts. But this only concerns containers in the
same pod. Containers of different pods can never run into port conflicts, because
each pod has a separate port space. All the containers in a pod also have the same
loopback network interface, so a container can communicate with other containers in
the same pod through localhost.
58 CHAPTER 3 Pods: running containers in Kubernetes
Node 1 Node 2
Flat network
Figure 3.2 Each pod gets a routable IP address and all other pods see the pod under
that IP address.
Main container
Supporting
container 1
Volume
Supporting
container 2
Figure 3.3 Pods should contain tightly coupled
containers, usually a main container and containers
Pod
that support the main one.
60 CHAPTER 3 Pods: running containers in Kubernetes
For example, the main container in a pod could be a web server that serves files from
a certain file directory, while an additional container (a sidecar container) periodi-
cally downloads content from an external source and stores it in the web server’s
directory. In chapter 6 you’ll see that you need to use a Kubernetes Volume that you
mount into both containers.
Other examples of sidecar containers include log rotators and collectors, data pro-
cessors, communication adapters, and others.
DECIDING WHEN TO USE MULTIPLE CONTAINERS IN A POD
To recap how containers should be grouped into pods—when deciding whether to
put two containers into a single pod or into two separate pods, you always need to ask
yourself the following questions:
Do they need to be run together or can they run on different hosts?
Do they represent a single whole or are they independent components?
Must they be scaled together or individually?
Basically, you should always gravitate toward running containers in separate pods,
unless a specific reason requires them to be part of the same pod. Figure 3.4 will help
you memorize this.
Frontend
process
Frontend
process Frontend
Frontend container
process Frontend
container
Frontend pod
Backend
process
Backend
process
Container
Backend
Backend process
container
Backend
container
Pod Pod
Backend pod
Figure 3.4 A container shouldn’t run multiple processes. A pod shouldn’t contain multiple
containers if they don’t need to run on the same machine.
Although pods can contain multiple containers, to keep things simple for now, you’ll
only be dealing with single-container pods in this chapter. You’ll see how multiple
containers are used in the same pod later, in chapter 6.
Creating pods from YAML or JSON descriptors 61
terminationMessagePath: /dev/termination-log
volumeMounts:
- mountPath: /var/run/secrets/k8s.io/servacc
name: default-token-kvcqa
readOnly: true
dnsPolicy: ClusterFirst
nodeName: gke-kubia-e8fe08b8-node-txje Pod specification/
restartPolicy: Always contents (list of
serviceAccount: default pod’s containers,
serviceAccountName: default
volumes, and so on)
terminationGracePeriodSeconds: 30
volumes:
- name: default-token-kvcqa
secret:
secretName: default-token-kvcqa
status:
conditions:
- lastProbeTime: null
lastTransitionTime: null
status: "True"
type: Ready
containerStatuses:
- containerID: docker://f0276994322d247ba...
image: luksa/kubia
imageID: docker://4c325bcc6b40c110226b89fe... Detailed status
lastState: {} of the pod and
name: kubia its containers
ready: true
restartCount: 0
state:
running:
startedAt: 2016-03-18T12:46:05Z
hostIP: 10.132.0.4
phase: Running
podIP: 10.0.2.3
startTime: 2016-03-18T12:44:32Z
I know this looks complicated, but it becomes simple once you understand the basics
and know how to distinguish between the important parts and the minor details. Also,
you can take comfort in the fact that when creating a new pod, the YAML you need to
write is much shorter, as you’ll see later.
INTRODUCING THE MAIN PARTS OF A POD DEFINITION
The pod definition consists of a few parts. First, there’s the Kubernetes API version
used in the YAML and the type of resource the YAML is describing. Then, three
important sections are found in almost all Kubernetes resources:
Metadata includes the name, namespace, labels, and other information about
the pod.
Spec contains the actual description of the pod’s contents, such as the pod’s con-
tainers, volumes, and other data.
Creating pods from YAML or JSON descriptors 63
Status contains the current information about the running pod, such as what
condition the pod is in, the description and status of each container, and the
pod’s internal IP and other basic info.
Listing 3.1 showed a full description of a running pod, including its status. The status
part contains read-only runtime data that shows the state of the resource at a given
moment. When creating a new pod, you never need to provide the status part.
The three parts described previously show the typical structure of a Kubernetes
API object. As you’ll see throughout the book, all other objects have the same anat-
omy. This makes understanding new objects relatively easy.
Going through all the individual properties in the previous YAML doesn’t make
much sense, so, instead, let’s see what the most basic YAML for creating a pod looks
like.
apiVersion: v1
kind: Pod The name
metadata: of the pod
name: kubia-manual
spec:
containers:
Container image to create
the container from
- image: luksa/kubia
name: kubia
ports: Name of the container
- containerPort: 8080
protocol: TCP The port the app
is listening on
I’m sure you’ll agree this is much simpler than the definition in listing 3.1. Let’s exam-
ine this descriptor in detail. It conforms to the v1 version of the Kubernetes API. The
type of resource you’re describing is a pod, with the name kubia-manual. The pod
consists of a single container based on the luksa/kubia image. You’ve also given a
name to the container and indicated that it’s listening on port 8080.
SPECIFYING CONTAINER PORTS
Specifying ports in the pod definition is purely informational. Omitting them has no
effect on whether clients can connect to the pod through the port or not. If the con-
64 CHAPTER 3 Pods: running containers in Kubernetes
tainer is accepting connections through a port bound to the 0.0.0.0 address, other
pods can always connect to it, even if the port isn’t listed in the pod spec explicitly. But
it makes sense to define the ports explicitly so that everyone using your cluster can
quickly see what ports each pod exposes. Explicitly defining ports also allows you to
assign a name to each port, which can come in handy, as you’ll see later in the book.
FIELDS:
kind <string>
Kind is a string value representing the REST resource this object
represents...
metadata <Object>
Standard object's metadata...
spec <Object>
Specification of the desired behavior of the pod...
status <Object>
Most recently observed status of the pod. This data may not be up to
date...
Kubectl prints out the explanation of the object and lists the attributes the object
can contain. You can then drill deeper to find out more about each attribute. For
example, you can examine the spec attribute like this:
$ kubectl explain pod.spec
RESOURCE: spec <Object>
DESCRIPTION:
Specification of the desired behavior of the pod...
podSpec is a description of a pod.
FIELDS:
hostPID <boolean>
Use the host's pid namespace. Optional: Default to false.
...
volumes <[]Object>
List of volumes that can be mounted by containers belonging to the
pod.
Creating pods from YAML or JSON descriptors 65
The kubectl create -f command is used for creating any resource (not only pods)
from a YAML or JSON file.
RETRIEVING THE WHOLE DEFINITION OF A RUNNING POD
After creating the pod, you can ask Kubernetes for the full YAML of the pod. You’ll
see it’s similar to the YAML you saw earlier. You’ll learn about the additional fields
appearing in the returned definition in the next sections. Go ahead and use the fol-
lowing command to see the full descriptor of the pod:
If you’re more into JSON, you can also tell kubectl to return JSON instead of YAML
like this (this works even if you used YAML to create the pod):
There’s your kubia-manual pod. Its status shows that it’s running. If you’re like me,
you’ll probably want to confirm that’s true by talking to the pod. You’ll do that in a
minute. First, you’ll look at the app’s log to check for any errors.
writing their logs to files. This is to allow users to view logs of different applications in
a simple, standard way.
The container runtime (Docker in your case) redirects those streams to files and
allows you to get the container’s log by running
You could use ssh to log into the node where your pod is running and retrieve its logs
with docker logs, but Kubernetes provides an easier way.
RETRIEVING A POD’S LOG WITH KUBECTL LOGS
To see your pod’s log (more precisely, the container’s log) you run the following com-
mand on your local machine (no need to ssh anywhere):
$ kubectl logs kubia-manual
Kubia server starting...
You haven’t sent any web requests to your Node.js app, so the log only shows a single
log statement about the server starting up. As you can see, retrieving logs of an appli-
cation running in Kubernetes is incredibly simple if the pod only contains a single
container.
NOTE Container logs are automatically rotated daily and every time the log file
reaches 10MB in size. The kubectl logs command only shows the log entries
from the last rotation.
SPECIFYING THE CONTAINER NAME WHEN GETTING LOGS OF A MULTI-CONTAINER POD
If your pod includes multiple containers, you have to explicitly specify the container
name by including the -c <container name> option when running kubectl logs. In
your kubia-manual pod, you set the container’s name to kubia, so if additional con-
tainers exist in the pod, you’d have to get its logs like this:
$ kubectl logs kubia-manual -c kubia
Kubia server starting...
Note that you can only retrieve container logs of pods that are still in existence. When
a pod is deleted, its logs are also deleted. To make a pod’s logs available even after the
pod is deleted, you need to set up centralized, cluster-wide logging, which stores all
the logs into a central store. Chapter 17 explains how centralized logging works.
The port forwarder is running and you can now connect to your pod through the
local port.
CONNECTING TO THE POD THROUGH THE PORT FORWARDER
In a different terminal, you can now use curl to send an HTTP request to your pod
through the kubectl port-forward proxy running on localhost:8888:
$ curl localhost:8888
You’ve hit kubia-manual
Figure 3.5 shows an overly simplified view of what happens when you send the request.
In reality, a couple of additional components sit between the kubectl process and the
pod, but they aren’t relevant right now.
Port Port
8888 8080
kubectl
Pod:
curl port-forward
kubia-manual
process
Figure 3.5 A simplified view of what happens when you use curl with kubectl port-forward
Using port forwarding like this is an effective way to test an individual pod. You’ll
learn about other similar methods throughout the book.
(multiple copies of the same component will be deployed) and multiple versions or
releases (stable, beta, canary, and so on) will run concurrently. This can lead to hun-
dreds of pods in the system. Without a mechanism for organizing them, you end up
with a big, incomprehensible mess, such as the one shown in figure 3.6. The figure
shows pods of multiple microservices, with several running multiple replicas, and others
running different releases of the same microservice.
Product
Shopping Catalog Order UI pod Shopping
Cart pod Service Cart
pod pod pod
Order Account
Service Product
Service
pod Product Catalog Account
UI pod pod
Catalog pod Service
pod pod
Product
Product Catalog
Catalog Order UI pod
Service pod Product
pod Catalog
pod
pod
UI pod
Product
UI pod
Catalog
pod
It’s evident you need a way of organizing them into smaller groups based on arbitrary
criteria, so every developer and system administrator dealing with your system can eas-
ily see which pod is which. And you’ll want to operate on every pod belonging to a cer-
tain group with a single action instead of having to perform the action for each pod
individually.
Organizing pods and all other Kubernetes objects is done through labels.
Let’s turn back to the microservices example from figure 3.6. By adding labels to
those pods, you get a much-better-organized system that everyone can easily make
sense of. Each pod is labeled with two labels:
app, which specifies which app, component, or microservice the pod belongs to.
rel, which shows whether the application running in the pod is a stable, beta,
or a canary release.
By adding these two labels, you’ve essentially organized your pods into two dimen-
sions (horizontally by app and vertically by release), as shown in figure 3.7.
Every developer or ops person with access to your cluster can now easily see the sys-
tem’s structure and where each pod fits in by looking at the pod’s labels.
apiVersion: v1
kind: Pod
metadata:
name: kubia-manual-v2
70 CHAPTER 3 Pods: running containers in Kubernetes
labels:
creation_method: manual Two labels are
env: prod attached to the pod.
spec:
containers:
- image: luksa/kubia
name: kubia
ports:
- containerPort: 8080
protocol: TCP
The kubectl get pods command doesn’t list any labels by default, but you can see
them by using the --show-labels switch:
$ kubectl get po --show-labels
NAME READY STATUS RESTARTS AGE LABELS
kubia-manual 1/1 Running 0 16m <none>
kubia-manual-v2 1/1 Running 0 2m creat_method=manual,env=prod
kubia-zxzij 1/1 Running 0 1d run=kubia
Instead of listing all labels, if you’re only interested in certain labels, you can specify
them with the -L switch and have each displayed in its own column. List pods again
and show the columns for the two labels you’ve attached to your kubia-manual-v2 pod:
$ kubectl get po -L creation_method,env
NAME READY STATUS RESTARTS AGE CREATION_METHOD ENV
kubia-manual 1/1 Running 0 16m <none> <none>
kubia-manual-v2 1/1 Running 0 2m manual prod
kubia-zxzij 1/1 Running 0 1d <none> <none>
Now, let’s also change the env=prod label to env=debug on the kubia-manual-v2 pod,
to see how existing labels can be changed.
NOTE You need to use the --overwrite option when changing existing labels.
As you can see, attaching labels to resources is trivial, and so is changing them on
existing resources. It may not be evident right now, but this is an incredibly powerful
feature, as you’ll see in the next chapter. But first, let’s see what you can do with these
labels, in addition to displaying them when listing pods.
To list all pods that include the env label, whatever its value is:
$ kubectl get po -l env
NAME READY STATUS RESTARTS AGE
kubia-manual-v2 1/1 Running 0 37m
NOTE Make sure to use single quotes around !env, so the bash shell doesn’t
evaluate the exclamation mark.
Similarly, you could also match pods with the following label selectors:
creation_method!=manual to select pods with the creation_method label with
any value other than manual
env in (prod,devel) to select pods with the env label set to either prod or
devel
env notin (prod,devel) to select pods with the env label set to any value other
than prod or devel
Figure 3.8 Selecting the product catalog microservice pods using the “app=pc” label selector
Now you can use a label selector when listing the nodes, like you did before with pods.
List only nodes that include the label gpu=true:
$ kubectl get nodes -l gpu=true
NAME STATUS AGE
gke-kubia-85f6-node-0rrx Ready 1d
As expected, only one node has this label. You can also try listing all the nodes and tell
kubectl to display an additional column showing the values of each node’s gpu label
(kubectl get nodes -L gpu).
Listing 3.4 Using a label selector to schedule a pod to a specific node: kubia-gpu.yaml
apiVersion: v1
kind: Pod
metadata:
name: kubia-gpu nodeSelector tells Kubernetes
spec: to deploy this pod only to
nodeSelector: nodes containing the
gpu: "true" gpu=true label.
containers:
- image: luksa/kubia
name: kubia
Annotating pods 75
You’ve added a nodeSelector field under the spec section. When you create the pod,
the scheduler will only choose among the nodes that contain the gpu=true label
(which is only a single node in your case).
request the full YAML of the pod or use the kubectl describe command. You’ll use the
first option in the following listing.
Without going into too many details, as you can see, the kubernetes.io/created-by
annotation holds JSON data about the object that created the pod. That’s not some-
thing you’d want to put into a label. Labels should be short, whereas annotations can
contain relatively large blobs of data (up to 256 KB in total).
You added the annotation mycompany.com/someannotation with the value foo bar.
It’s a good idea to use this format for annotation keys to prevent key collisions. When
different tools or libraries add annotations to objects, they may accidentally override
each other’s annotations if they don’t use unique prefixes like you did here.
You can use kubectl describe to see the annotation you added:
$ kubectl describe pod kubia-manual
...
Annotations: mycompany.com/someannotation=foo bar
...
But what about times when you want to split objects into separate, non-overlapping
groups? You may want to only operate inside one group at a time. For this and other
reasons, Kubernetes also groups objects into namespaces. These aren’t the Linux
namespaces we talked about in chapter 2, which are used to isolate processes from
each other. Kubernetes namespaces provide a scope for objects names. Instead of hav-
ing all your resources in one single namespace, you can split them into multiple name-
spaces, which also allows you to use the same resource names multiple times (across
different namespaces).
Up to this point, you’ve operated only in the default namespace. When listing resources
with the kubectl get command, you’ve never specified the namespace explicitly, so
kubectl always defaulted to the default namespace, showing you only the objects in
that namespace. But as you can see from the list, the kube-public and the kube-system
namespaces also exist. Let’s look at the pods that belong to the kube-system name-
space, by telling kubectl to list pods in that namespace only:
$ kubectl get po --namespace kube-system
NAME READY STATUS RESTARTS AGE
fluentd-cloud-kubia-e8fe-node-txje 1/1 Running 0 1h
heapster-v11-fz1ge 1/1 Running 0 1h
kube-dns-v9-p8a4t 0/4 Pending 0 1h
kube-ui-v4-kdlai 1/1 Running 0 1h
l7-lb-controller-v0.5.2-bue96 2/2 Running 92 1h
You’ll learn about these pods later in the book (don’t worry if the pods shown here
don’t match the ones on your system exactly). It’s clear from the name of the name-
space that these are resources related to the Kubernetes system itself. By having
them in this separate namespace, it keeps everything nicely organized. If they were
all in the default namespace, mixed in with the resources you create yourself, you’d
have a hard time seeing what belongs where, and you might inadvertently delete sys-
tem resources.
Namespaces enable you to separate resources that don’t belong together into non-
overlapping groups. If several users or groups of users are using the same Kubernetes
cluster, and they each manage their own distinct set of resources, they should each use
their own namespace. This way, they don’t need to take any special care not to inad-
vertently modify or delete the other users’ resources and don’t need to concern them-
selves with name conflicts, because namespaces provide a scope for resource names,
as has already been mentioned.
Besides isolating resources, namespaces are also used for allowing only certain users
access to particular resources and even for limiting the amount of computational
resources available to individual users. You’ll learn about this in chapters 12 through 14.
Now, use kubectl to post the file to the Kubernetes API server:
$ kubectl create -f custom-namespace.yaml
namespace "custom-namespace" created
has a corresponding API object that you can create, read, update, and delete by post-
ing a YAML manifest to the API server.
You could have created the namespace like this:
NOTE Although most objects’ names must conform to the naming conven-
tions specified in RFC 1035 (Domain names), which means they may contain
only letters, digits, dashes, and dots, namespaces (and a few others) aren’t
allowed to contain dots.
You now have two pods with the same name (kubia-manual). One is in the default
namespace, and the other is in your custom-namespace.
When listing, describing, modifying, or deleting objects in other namespaces, you
need to pass the --namespace (or -n) flag to kubectl. If you don’t specify the name-
space, kubectl performs the action in the default namespace configured in the cur-
rent kubectl context. The current context’s namespace and the current context itself
can be changed through kubectl config commands. To learn more about managing
kubectl contexts, refer to appendix A.
TIP To quickly switch to a different namespace, you can set up the following
alias: alias kcd='kubectl config set-context $(kubectl config current-
context) --namespace '. You can then switch between namespaces using kcd
some-namespace.
address of a pod in namespace bar, there is nothing preventing it from sending traffic,
such as HTTP requests, to the other pod.
By deleting a pod, you’re instructing Kubernetes to terminate all the containers that are
part of that pod. Kubernetes sends a SIGTERM signal to the process and waits a certain
number of seconds (30 by default) for it to shut down gracefully. If it doesn’t shut down
in time, the process is then killed through SIGKILL. To make sure your processes are
always shut down gracefully, they need to handle the SIGTERM signal properly.
TIP You can also delete more than one pod by specifying multiple, space-sep-
arated names (for example, kubectl delete po pod1 pod2).
In the earlier microservices example, where you had tens (or possibly hundreds) of
pods, you could, for instance, delete all canary pods at once by specifying the
rel=canary label selector (visualized in figure 3.10):
rel=stable
app: ui Account app: as Product app: pc Shopping app: sc Order app: os
UI pod rel: stable Service rel: stable Catalog rel: stable Cart rel: stable Service rel: stable
pod pod pod pod
rel=beta
Figure 3.10 Selecting and deleting all canary pods through the rel=canary label selector
delete the whole namespace (the pods will be deleted along with the namespace auto-
matically), using the following command:
$ kubectl delete ns custom-namespace
namespace "custom-namespace" deleted
This time, instead of deleting the specific pod, tell Kubernetes to delete all pods in the
current namespace by using the --all option:
$ kubectl delete po --all
pod "kubia-zxzij" deleted
Wait, what!?! The kubia-zxzij pod is terminating, but a new pod called kubia-09as0,
which wasn’t there before, has appeared. No matter how many times you delete all
pods, a new pod called kubia-something will emerge.
You may remember you created your first pod with the kubectl run command. In
chapter 2, I mentioned that this doesn’t create a pod directly, but instead creates a
ReplicationController, which then creates the pod. As soon as you delete a pod cre-
ated by the ReplicationController, it immediately creates a new one. To delete the
pod, you also need to delete the ReplicationController.
The first all in the command specifies that you’re deleting resources of all types, and
the --all option specifies that you’re deleting all resource instances instead of speci-
fying them by name (you already used this option when you ran the previous delete
command).
NOTE Deleting everything with the all keyword doesn’t delete absolutely
everything. Certain resources (like Secrets, which we’ll introduce in chapter 7)
are preserved and need to be deleted explicitly.
As it deletes resources, kubectl will print the name of every resource it deletes. In the
list, you should see the kubia ReplicationController and the kubia-http Service you
created in chapter 2.
NOTE The kubectl delete all --all command also deletes the kubernetes
Service, but it should be recreated automatically in a few moments.
3.9 Summary
After reading this chapter, you should now have a decent knowledge of the central
building block in Kubernetes. Every other concept you’ll learn about in the next few
chapters is directly related to pods.
In this chapter, you’ve learned
How to decide whether certain containers should be grouped together in a pod
or not.
Summary 83
Pods can run multiple processes and are similar to physical hosts in the non-
container world.
YAML or JSON descriptors can be written and used to create pods and then
examined to see the specification of a pod and its current state.
Labels and label selectors should be used to organize pods and easily perform
operations on multiple pods at once.
You can use node labels and selectors to schedule pods only to nodes that have
certain features.
Annotations allow attaching larger blobs of data to pods either by people or
tools and libraries.
Namespaces can be used to allow different teams to use the same cluster as
though they were using separate Kubernetes clusters.
How to use the kubectl explain command to quickly look up the information
on any Kubernetes resource.
In the next chapter, you’ll learn about ReplicationControllers and other resources
that manage pods.
Replication and other
controllers: deploying
managed pods
As you’ve learned so far, pods represent the basic deployable unit in Kubernetes.
You know how to create, supervise, and manage them manually. But in real-world
use cases, you want your deployments to stay up and running automatically and
remain healthy without any manual intervention. To do this, you almost never cre-
ate pods directly. Instead, you create other types of resources, such as Replication-
Controllers or Deployments, which then create and manage the actual pods.
When you create unmanaged pods (such as the ones you created in the previ-
ous chapter), a cluster node is selected to run the pod and then its containers are
run on that node. In this chapter, you’ll learn that Kubernetes then monitors
84
Keeping pods healthy 85
those containers and automatically restarts them if they fail. But if the whole node
fails, the pods on the node are lost and will not be replaced with new ones, unless
those pods are managed by the previously mentioned ReplicationControllers or simi-
lar. In this chapter, you’ll learn how Kubernetes checks if a container is still alive and
restarts it if it isn’t. You’ll also learn how to run managed pods—both those that run
indefinitely and those that perform a single task and then stop.
NOTE Kubernetes also supports readiness probes, which we’ll learn about in the
next chapter. Be sure not to confuse the two. They’re used for two different
things.
response code doesn’t represent an error (in other words, if the HTTP response
code is 2xx or 3xx), the probe is considered successful. If the server returns an
error response code or if it doesn’t respond at all, the probe is considered a fail-
ure and the container will be restarted as a result.
A TCP Socket probe tries to open a TCP connection to the specified port of the
container. If the connection is established successfully, the probe is successful.
Otherwise, the container is restarted.
An Exec probe executes an arbitrary command inside the container and checks
the command’s exit status code. If the status code is 0, the probe is successful.
All other codes are considered failures.
apiVersion: v1
kind: pod
metadata:
name: kubia-liveness This is the image
spec: containing the
containers: (somewhat)
- image: luksa/kubia-unhealthy broken app.
name: kubia
livenessProbe: A liveness probe that will
httpGet: perform an HTTP GET
path: /
port: 8080
The path to
request in the
The network port HTTP request
the probe should
connect to
Keeping pods healthy 87
The pod descriptor defines an httpGet liveness probe, which tells Kubernetes to peri-
odically perform HTTP GET requests on path / on port 8080 to determine if the con-
tainer is still healthy. These requests start as soon as the container is run.
After five such requests (or actual client requests), your app starts returning
HTTP status code 500, which Kubernetes will treat as a probe failure, and will thus
restart the container.
The RESTARTS column shows that the pod’s container has been restarted once (if you
wait another minute and a half, it gets restarted again, and then the cycle continues
indefinitely).
When you want to figure out why the previous container terminated, you’ll want to
see those logs instead of the current container’s logs. This can be done by using
the --previous option:
You can see why the container had to be restarted by looking at what kubectl describe
prints out, as shown in the following listing.
You can see that the container is currently running, but it previously terminated
because of an error. The exit code was 137, which has a special meaning—it denotes
that the process was terminated by an external signal. The number 137 is a sum of two
numbers: 128+x, where x is the signal number sent to the process that caused it to ter-
minate. In the example, x equals 9, which is the number of the SIGKILL signal, mean-
ing the process was killed forcibly.
The events listed at the bottom show why the container was killed—Kubernetes
detected the container was unhealthy, so it killed and re-created it.
Beside the liveness probe options you specified explicitly, you can also see additional
properties, such as delay, timeout, period, and so on. The delay=0s part shows that
the probing begins immediately after the container is started. The timeout is set to
only 1 second, so the container must return a response in 1 second or the probe is
counted as failed. The container is probed every 10 seconds (period=10s) and the
container is restarted after the probe fails three consecutive times (#failure=3).
These additional parameters can be customized when defining the probe. For
example, to set the initial delay, add the initialDelaySeconds property to the live-
ness probe as shown in the following listing.
livenessProbe:
httpGet:
path: /
Keeping pods healthy 89
port: 8080
initialDelaySeconds: 15
Kubernetes will wait 15 seconds
before executing the first probe.
If you don’t set the initial delay, the prober will start probing the container as soon as
it starts, which usually leads to the probe failing, because the app isn’t ready to start
receiving requests. If the number of failures exceeds the failure threshold, the con-
tainer is restarted before it’s even able to start responding to requests properly.
TIP Always remember to set an initial delay to account for your app’s startup
time.
I’ve seen this on many occasions and users were confused why their container was
being restarted. But if they’d used kubectl describe, they’d have seen that the con-
tainer terminated with exit code 137 or 143, telling them that the pod was terminated
externally. Additionally, the listing of the pod’s events would show that the container
was killed because of a failed liveness probe. If you see this happening at pod startup,
it’s because you failed to set initialDelaySeconds appropriately.
NOTE Exit code 137 signals that the process was killed by an external signal
(exit code is 128 + 9 (SIGKILL). Likewise, exit code 143 corresponds to 128 +
15 (SIGTERM).
TIP Make sure the /health HTTP endpoint doesn’t require authentication;
otherwise the probe will always fail, causing your container to be restarted
indefinitely.
Be sure to check only the internals of the app and nothing influenced by an external
factor. For example, a frontend web server’s liveness probe shouldn’t return a failure
when the server can’t connect to the backend database. If the underlying cause is in
the database itself, restarting the web server container will not fix the problem.
90 CHAPTER 4 Replication and other controllers: deploying managed pods
Because the liveness probe will fail again, you’ll end up with the container restarting
repeatedly until the database becomes accessible again.
KEEPING PROBES LIGHT
Liveness probes shouldn’t use too many computational resources and shouldn’t take
too long to complete. By default, the probes are executed relatively often and are
only allowed one second to complete. Having a probe that does heavy lifting can slow
down your container considerably. Later in the book, you’ll also learn about how to
limit CPU time available to a container. The probe’s CPU time is counted in the con-
tainer’s CPU time quota, so having a heavyweight liveness probe will reduce the CPU
time available to the main application processes.
TIP If you’re running a Java app in your container, be sure to use an HTTP
GET liveness probe instead of an Exec probe, where you spin up a whole new
JVM to get the liveness information. The same goes for any JVM-based or sim-
ilar applications, whose start-up procedure requires considerable computa-
tional resources.
DON’T BOTHER IMPLEMENTING RETRY LOOPS IN YOUR PROBES
You’ve already seen that the failure threshold for the probe is configurable and usu-
ally the probe must fail multiple times before the container is killed. But even if you
set the failure threshold to 1, Kubernetes will retry the probe several times before con-
sidering it a single failed attempt. Therefore, implementing your own retry loop into
the probe is wasted effort.
LIVENESS PROBE WRAP-UP
You now understand that Kubernetes keeps your containers running by restarting
them if they crash or if their liveness probes fail. This job is performed by the Kubelet
on the node hosting the pod—the Kubernetes Control Plane components running on
the master(s) have no part in this process.
But if the node itself crashes, it’s the Control Plane that must create replacements for
all the pods that went down with the node. It doesn’t do that for pods that you create
directly. Those pods aren’t managed by anything except by the Kubelet, but because the
Kubelet runs on the node itself, it can’t do anything if the node fails.
To make sure your app is restarted on another node, you need to have the pod
managed by a ReplicationController or similar mechanism, which we’ll discuss in the
rest of this chapter.
Various Various
Pod A Pod A
other pods other pods
Node 1 fails
Creates and
manages
ReplicationController ReplicationController
RC notices pod B is
missing and creates
a new pod instance.
Figure 4.1 When a node fails, only pods backed by a ReplicationController are recreated.
I’ve used the term pod “type” a few times. But no such thing exists. Replication-
Controllers don’t operate on pod types, but on sets of pods that match a certain label
selector (you learned about them in the previous chapter).
INTRODUCING THE CONTROLLER’S RECONCILIATION LOOP
A ReplicationController’s job is to make sure that an exact number of pods always
matches its label selector. If it doesn’t, the ReplicationController takes the appropriate
action to reconcile the actual with the desired number. The operation of a Replication-
Controller is shown in figure 4.2.
Start
Find pods
matching the
label selector
Just enough
ReplicationController: kubia
A ReplicationController’s replica count, the label selector, and even the pod tem-
plate can all be modified at any time, but only changes to the replica count affect
existing pods.
UNDERSTANDING THE EFFECT OF CHANGING THE CONTROLLER’S LABEL SELECTOR OR POD TEMPLATE
Changes to the label selector and the pod template have no effect on existing pods.
Changing the label selector makes the existing pods fall out of the scope of the
ReplicationController, so the controller stops caring about them. ReplicationCon-
trollers also don’t care about the actual “contents” of its pods (the container images,
environment variables, and other things) after they create the pod. The template
therefore only affects new pods created by this ReplicationController. You can think
of it as a cookie cutter for cutting out new pods.
UNDERSTANDING THE BENEFITS OF USING A REPLICATIONCONTROLLER
Like many things in Kubernetes, a ReplicationController, although an incredibly sim-
ple concept, provides or enables the following powerful features:
It makes sure a pod (or multiple pod replicas) is always running by starting a
new pod when an existing one goes missing.
When a cluster node fails, it creates replacement replicas for all the pods that
were running on the failed node (those that were under the Replication-
Controller’s control).
It enables easy horizontal scaling of pods—both manual and automatic (see
horizontal pod auto-scaling in chapter 15).
template:
metadata:
labels:
app: kubia
spec:
The pod template
for creating new
containers:
pods
- name: kubia
image: luksa/kubia
ports:
- containerPort: 8080
When you post the file to the API server, Kubernetes creates a new Replication-
Controller named kubia, which makes sure three pod instances always match the
label selector app=kubia. When there aren’t enough pods, new pods will be created
from the provided pod template. The contents of the template are almost identical to
the pod definition you created in the previous chapter.
The pod labels in the template must obviously match the label selector of the
ReplicationController; otherwise the controller would create new pods indefinitely,
because spinning up a new pod wouldn’t bring the actual replica count any closer to
the desired number of replicas. To prevent such scenarios, the API server verifies the
ReplicationController definition and will not accept it if it’s misconfigured.
Not specifying the selector at all is also an option. In that case, it will be configured
automatically from the labels in the pod template.
To create the ReplicationController, use the kubectl create command, which you
already know:
$ kubectl create -f kubia-rc.yaml
replicationcontroller "kubia" created
Indeed, it has! You wanted three pods, and it created three pods. It’s now managing
those three pods. Next you’ll mess with them a little to see how the Replication-
Controller responds.
SEEING THE REPLICATIONCONTROLLER RESPOND TO A DELETED POD
First, you’ll delete one of the pods manually to see how the ReplicationController spins
up a new one immediately, bringing the number of matching pods back to three:
$ kubectl delete pod kubia-53thy
pod "kubia-53thy" deleted
Listing the pods again shows four of them, because the one you deleted is terminat-
ing, and a new pod has already been created:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
kubia-53thy 1/1 Terminating 0 3m
kubia-oini2 0/1 ContainerCreating 0 2s
kubia-k0xz6 1/1 Running 0 3m
kubia-q3vkg 1/1 Running 0 3m
The ReplicationController has done its job again. It’s a nice little helper, isn’t it?
GETTING INFORMATION ABOUT A REPLICATIONCONTROLLER
Now, let’s see what information the kubectl get command shows for Replication-
Controllers:
$ kubectl get rc
NAME DESIRED CURRENT READY AGE
kubia 3 3 2 3m
You see three columns showing the desired number of pods, the actual number of
pods, and how many of them are ready (you’ll learn what that means in the next chap-
ter, when we talk about readiness probes).
You can see additional information about your ReplicationController with the
kubectl describe command, as shown in the following listing.
The current number of replicas matches the desired number, because the controller
has already created a new pod. It shows four running pods because a pod that’s termi-
nating is still considered running, although it isn’t counted in the current replica count.
The list of events at the bottom shows the actions taken by the Replication-
Controller—it has created four pods so far.
UNDERSTANDING EXACTLY WHAT CAUSED THE CONTROLLER TO CREATE A NEW POD
The controller is responding to the deletion of a pod by creating a new replacement
pod (see figure 4.4). Well, technically, it isn’t responding to the deletion itself, but the
resulting state—the inadequate number of pods.
While a ReplicationController is immediately notified about a pod being deleted
(the API server allows clients to watch for changes to resources and resource lists), that’s
not what causes it to create a replacement pod. The notification triggers the controller
to check the actual number of pods and take appropriate action.
[ContainerCreating] [Terminating]
Replicas: 3 Replicas: 3
Selector: app=kubia Selector: app=kubia
Figure 4.4 If a pod disappears, the ReplicationController sees too few pods and creates a new replacement pod.
Introducing ReplicationControllers 97
NOTE If you’re using Minikube, you can’t do this exercise, because you only
have one node that acts both as a master and a worker node.
If a node fails in the non-Kubernetes world, the ops team would need to migrate the
applications running on that node to other machines manually. Kubernetes, on the
other hand, does that automatically. Soon after the ReplicationController detects that
its pods are down, it will spin up new pods to replace them.
Let’s see this in action. You need to ssh into one of the nodes with the gcloud
compute ssh command and then shut down its network interface with sudo ifconfig
eth0 down, as shown in the following listing.
NOTE Choose a node that runs at least one of your pods by listing pods with
the -o wide option.
Listing 4.6 Simulating a node failure by shutting down its network interface
When you shut down the network interface, the ssh session will stop responding, so
you need to open up another terminal or hard-exit from the ssh session. In the new
terminal you can list the nodes to see if Kubernetes has detected that the node is
down. This takes a minute or so. Then, the node’s status is shown as NotReady:
$ kubectl get node
NAME STATUS AGE Node isn’t ready,
gke-kubia-default-pool-b46381f1-opc5 Ready 5h because it’s
gke-kubia-default-pool-b46381f1-s8gj Ready 5h disconnected from
gke-kubia-default-pool-b46381f1-zwko NotReady 5h the network
If you list the pods now, you’ll still see the same three pods as before, because Kuber-
netes waits a while before rescheduling pods (in case the node is unreachable because
of a temporary network glitch or because the Kubelet is restarting). If the node stays
unreachable for several minutes, the status of the pods that were scheduled to that
node changes to Unknown. At that point, the ReplicationController will immediately
spin up a new pod. You can see this by listing the pods again:
98 CHAPTER 4 Replication and other controllers: deploying managed pods
Looking at the age of the pods, you see that the kubia-dmdck pod is new. You again
have three pod instances running, which means the ReplicationController has again
done its job of bringing the actual state of the system to the desired state.
The same thing happens if a node fails (either breaks down or becomes unreach-
able). No immediate human intervention is necessary. The system heals itself
automatically.
To bring the node back, you need to reset it with the following command:
When the node boots up again, its status should return to Ready, and the pod whose
status was Unknown will be deleted.
TIP Although a pod isn’t tied to a ReplicationController, the pod does refer-
ence it in the metadata.ownerReferences field, which you can use to easily
find which ReplicationController a pod belongs to.
You’ve added the type=special label to one of the pods. Listing all pods again shows
the same three pods as before, because no change occurred as far as the Replication-
Controller is concerned.
CHANGING THE LABELS OF A MANAGED POD
Now, you’ll change the app=kubia label to something else. This will make the pod no
longer match the ReplicationController’s label selector, leaving it to only match two
pods. The ReplicationController should therefore start a new pod to bring the num-
ber back to three:
$ kubectl label pod kubia-dmdck app=foo --overwrite
pod "kubia-dmdck" labeled
The --overwrite argument is necessary; otherwise kubectl will only print out a warn-
ing and won’t change the label, to prevent you from inadvertently changing an exist-
ing label’s value when your intent is to add a new one.
Listing all the pods again should now show four pods:
Newly created pod that replaces
the pod you removed from the
scope of the ReplicationController
$ kubectl get pods -L app
NAME READY STATUS RESTARTS AGE APP
kubia-2qneh 0/1 ContainerCreating 0 2s kubia
kubia-oini2 1/1 Running 0 20m kubia
kubia-k0xz6 1/1 Running 0 20m kubia Pod no longer
kubia-dmdck 1/1 Running 0 10m foo managed by the
ReplicationController
NOTE You’re using the -L app option to display the app label in a column.
There, you now have four pods altogether: one that isn’t managed by your Replication-
Controller and three that are. Among them is the newly created pod.
Figure 4.5 illustrates what happened when you changed the pod’s labels so they no
longer matched the ReplicationController’s pod selector. You can see your three pods
and your ReplicationController. After you change the pod’s label from app=kubia to
app=foo, the ReplicationController no longer cares about the pod. Because the con-
troller’s replica count is set to 3 and only two pods match the label selector, the
100 CHAPTER 4 Replication and other controllers: deploying managed pods
[ContainerCreating]
Replicas: 3 Replicas: 3
Selector: app=kubia Selector: app=kubia
Figure 4.5 Removing a pod from the scope of a ReplicationController by changing its labels
chapter and which are also used for managing pods. You’ll never change a controller’s
label selector, but you’ll regularly change its pod template. Let’s take a look at that.
A B C A B C A B C D A B
Figure 4.6 Changing a ReplicationController’s pod template only affects pods created afterward and has no
effect on existing pods.
As an exercise, you can try editing the ReplicationController and adding a label to the
pod template. You can edit the ReplicationController with the following command:
This will open the ReplicationController’s YAML definition in your default text editor.
Find the pod template section and add an additional label to the metadata. After you
save your changes and exit the editor, kubectl will update the ReplicationController
and print the following message:
You can now list pods and their labels again and confirm that they haven’t changed.
But if you delete the pods and wait for their replacements to be created, you’ll see the
new label.
Editing a ReplicationController like this to change the container image in the pod
template, deleting the existing pods, and letting them be replaced with new ones from
the new template could be used for upgrading pods, but you’ll learn a better way of
doing that in chapter 9.
102 CHAPTER 4 Replication and other controllers: deploying managed pods
export KUBE_EDITOR="/usr/bin/nano"
If the KUBE_EDITOR environment variable isn’t set, kubectl edit falls back to using
the default editor, usually configured through the EDITOR environment variable.
When the text editor opens, find the spec.replicas field and change its value to 10,
as shown in the following listing.
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving
# this file will be reopened with the relevant failures.
apiVersion: v1
kind: ReplicationController
Introducing ReplicationControllers 103
metadata:
...
spec: Change the number 3
replicas: 3 to number 10 in
selector: this line.
app: kubia
...
When you save the file and close the editor, the ReplicationController is updated and
it immediately scales the number of pods to 10:
$ kubectl get rc
NAME DESIRED CURRENT READY AGE
kubia 10 10 4 21m
There you go. If the kubectl scale command makes it look as though you’re telling
Kubernetes exactly what to do, it’s now much clearer that you’re making a declarative
change to the desired state of the ReplicationController and not telling Kubernetes to
do something.
SCALING DOWN WITH THE KUBECTL SCALE COMMAND
Now scale back down to 3. You can use the kubectl scale command:
All this command does is modify the spec.replicas field of the ReplicationController’s
definition—like when you changed it through kubectl edit.
UNDERSTANDING THE DECLARATIVE APPROACH TO SCALING
Horizontally scaling pods in Kubernetes is a matter of stating your desire: “I want to
have x number of instances running.” You’re not telling Kubernetes what or how to do
it. You’re just specifying the desired state.
This declarative approach makes interacting with a Kubernetes cluster easy. Imag-
ine if you had to manually determine the current number of running instances and
then explicitly tell Kubernetes how many additional instances to run. That’s more
work and is much more error-prone. Changing a simple number is much easier, and
in chapter 15, you’ll learn that even that can be done by Kubernetes itself if you
enable horizontal pod auto-scaling.
ReplicationController: kubia
Replicas: 3
Selector: app=kubia
Figure 4.7 Deleting a replication controller with --cascade=false leaves pods unmanaged.
pods and keep them running without interruption while you replace the Replication-
Controller that manages them.
When deleting a ReplicationController with kubectl delete, you can keep its
pods running by passing the --cascade=false option to the command. Try that now:
$ kubectl delete rc kubia --cascade=false
replicationcontroller "kubia" deleted
You’ve deleted the ReplicationController so the pods are on their own. They are no
longer managed. But you can always create a new ReplicationController with the
proper label selector and make them managed again.
You usually won’t create them directly, but instead have them created automati-
cally when you create the higher-level Deployment resource, which you’ll learn about
in chapter 9. In any case, you should understand ReplicaSets, so let’s see how they dif-
fer from ReplicationControllers.
apiVersion: apps/v1beta2
ReplicaSets aren’t part of the v1
kind: ReplicaSet
API, but belong to the apps API
metadata:
group and version v1beta2.
name: kubia
spec:
replicas: 3 You’re using the simpler matchLabels
selector: selector here, which is much like a
matchLabels: ReplicationController’s selector.
app: kubia
template:
metadata:
labels:
The template is
app: kubia
the same as in the
spec:
ReplicationController.
containers:
- name: kubia
image: luksa/kubia
106 CHAPTER 4 Replication and other controllers: deploying managed pods
The first thing to note is that ReplicaSets aren’t part of the v1 API, so you need to
ensure you specify the proper apiVersion when creating the resource. You’re creating a
resource of type ReplicaSet which has much the same contents as the Replication-
Controller you created earlier.
The only difference is in the selector. Instead of listing labels the pods need to
have directly under the selector property, you’re specifying them under selector
.matchLabels. This is the simpler (and less expressive) way of defining label selectors
in a ReplicaSet. Later, you’ll look at the more expressive option, as well.
You’ll see throughout the book that certain Kubernetes resources are in what’s called
the core API group, which doesn’t need to be specified in the apiVersion field (you
just specify the version—for example, you’ve been using apiVersion: v1 when
defining Pod resources). Other resources, which were introduced in later Kubernetes
versions, are categorized into several API groups. Look at the inside of the book’s
covers to see all resources and their respective API groups.
Because you still have three pods matching the app=kubia selector running from ear-
lier, creating this ReplicaSet will not cause any new pods to be created. The ReplicaSet
will take those existing three pods under its wing.
$ kubectl describe rs
Name: kubia
Namespace: default
Selector: app=kubia
Labels: app=kubia
Annotations: <none>
Replicas: 3 current / 3 desired
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=kubia
Using ReplicaSets instead of ReplicationControllers 107
Containers: ...
Volumes: <none>
Events: <none>
As you can see, the ReplicaSet isn’t any different from a ReplicationController. It’s
showing it has three replicas matching the selector. If you list all the pods, you’ll see
they’re still the same three pods you had before. The ReplicaSet didn’t create any new
ones.
NOTE Only the selector is shown. You’ll find the whole ReplicaSet definition
in the book’s code archive.
You can add additional expressions to the selector. As in the example, each expression
must contain a key, an operator, and possibly (depending on the operator) a list of
values. You’ll see four valid operators:
In—Label’s value must match one of the specified values.
NotIn—Label’s value must not match any of the specified values.
Exists—Pod must include a label with the specified key (the value isn’t import-
ant). When using this operator, you shouldn’t specify the values field.
DoesNotExist—Pod must not include a label with the specified key. The values
property must not be specified.
If you specify multiple expressions, all those expressions must evaluate to true for the
selector to match a pod. If you specify both matchLabels and matchExpressions, all
the labels must match and all the expressions must evaluate to true for the pod to
match the selector.
108 CHAPTER 4 Replication and other controllers: deploying managed pods
Deleting the ReplicaSet should delete all the pods. List the pods to confirm that’s
the case.
Pod Pod
ReplicaSet DaemonSet
Figure 4.8 DaemonSets run only a single pod replica on each node, whereas ReplicaSets
scatter them around the whole cluster randomly.
Running exactly one pod on each node with DaemonSets 109
Outside of Kubernetes, such processes would usually be started through system init
scripts or the systemd daemon during node boot up. On Kubernetes nodes, you can
still use systemd to run your system processes, but then you can’t take advantage of all
the features Kubernetes provides.
NOTE Later in the book, you’ll learn that nodes can be made unschedulable,
preventing pods from being deployed to them. A DaemonSet will deploy pods
even to such nodes, because the unschedulable attribute is only used by the
Scheduler, whereas pods managed by a DaemonSet bypass the Scheduler
completely. This is usually desirable, because DaemonSets are meant to run
system services, which usually need to run even on unschedulable nodes.
EXPLAINING DAEMONSETS WITH AN EXAMPLE
Let’s imagine having a daemon called ssd-monitor that needs to run on all nodes
that contain a solid-state drive (SSD). You’ll create a DaemonSet that runs this dae-
mon on all nodes that are marked as having an SSD. The cluster administrators have
added the disk=ssd label to all such nodes, so you’ll create the DaemonSet with a
node selector that only selects nodes with that label, as shown in figure 4.9.
110 CHAPTER 4 Replication and other controllers: deploying managed pods
DaemonSet:
sssd-monitor
Node selector:
disk=ssd
Figure 4.9 Using a DaemonSet with a node selector to deploy system pods only on certain
nodes
apiVersion: apps/v1beta2
DaemonSets are in the
kind: DaemonSet
apps API group,
metadata:
version v1beta2.
name: ssd-monitor
spec:
selector:
matchLabels:
app: ssd-monitor
template:
metadata:
labels:
app: ssd-monitor
spec:
The pod template includes a
nodeSelector:
node selector, which selects
disk: ssd
nodes with the disk=ssd label.
containers:
- name: main
image: luksa/ssd-monitor
You’re defining a DaemonSet that will run a pod with a single container based on the
luksa/ssd-monitor container image. An instance of this pod will be created for each
node that has the disk=ssd label.
Running exactly one pod on each node with DaemonSets 111
Those zeroes look strange. Didn’t the DaemonSet deploy any pods? List the pods:
$ kubectl get po
No resources found.
Where are the pods? Do you know what’s going on? Yes, you forgot to label your nodes
with the disk=ssd label. No problem—you can do that now. The DaemonSet should
detect that the nodes’ labels have changed and deploy the pod to all nodes with a
matching label. Let’s see if that’s true.
ADDING THE REQUIRED LABEL TO YOUR NODE(S)
Regardless if you’re using Minikube, GKE, or another multi-node cluster, you’ll need
to list the nodes first, because you’ll need to know the node’s name when labeling it:
$ kubectl get node
NAME STATUS AGE VERSION
minikube Ready 4d v1.6.0
Now, add the disk=ssd label to one of your nodes like this:
$ kubectl label node minikube disk=ssd
node "minikube" labeled
NOTE Replace minikube with the name of one of your nodes if you’re not
using Minikube.
The DaemonSet should have created one pod now. Let’s see:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
ssd-monitor-hgxwq 1/1 Running 0 35s
Okay; so far so good. If you have multiple nodes and you add the same label to further
nodes, you’ll see the DaemonSet spin up pods for each of them.
REMOVING THE REQUIRED LABEL FROM THE NODE
Now, imagine you’ve made a mistake and have mislabeled one of the nodes. It has a
spinning disk drive, not an SSD. What happens if you change the node’s label?
$ kubectl label node minikube disk=hdd --overwrite
node "minikube" labeled
112 CHAPTER 4 Replication and other controllers: deploying managed pods
Let’s see if the change has any effect on the pod that was running on that node:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
ssd-monitor-hgxwq 1/1 Terminating 0 4m
The pod is being terminated. But you knew that was going to happen, right? This
wraps up your exploration of DaemonSets, so you may want to delete your ssd-monitor
DaemonSet. If you still have any other daemon pods running, you’ll see that deleting
the DaemonSet deletes those pods as well.
Pod A (unmanaged)
Time
Figure 4.10 Pods managed by Jobs are rescheduled until they finish successfully.
Jobs are part of the batch API group and v1 API version. The YAML defines a
resource of type Job that will run the luksa/batch-job image, which invokes a pro-
cess that runs for exactly 120 seconds and then exits.
In a pod’s specification, you can specify what Kubernetes should do when the
processes running in the container finish. This is done through the restartPolicy
114 CHAPTER 4 Replication and other controllers: deploying managed pods
pod spec property, which defaults to Always. Job pods can’t use the default policy,
because they’re not meant to run indefinitely. Therefore, you need to explicitly set
the restart policy to either OnFailure or Never. This setting is what prevents the con-
tainer from being restarted when it finishes (not the fact that the pod is being man-
aged by a Job resource).
$ kubectl get po
NAME READY STATUS RESTARTS AGE
batch-job-28qf4 1/1 Running 0 4s
After the two minutes have passed, the pod will no longer show up in the pod list and
the Job will be marked as completed. By default, completed pods aren’t shown when
you list pods, unless you use the --show-all (or -a) switch:
$ kubectl get po -a
NAME READY STATUS RESTARTS AGE
batch-job-28qf4 0/1 Completed 0 2m
The reason the pod isn’t deleted when it completes is to allow you to examine its logs;
for example:
$ kubectl logs batch-job-28qf4
Fri Apr 29 09:58:22 UTC 2016 Batch job starting
Fri Apr 29 10:00:22 UTC 2016 Finished succesfully
The pod will be deleted when you delete it or the Job that created it. Before you do
that, let’s look at the Job resource again:
$ kubectl get job
NAME DESIRED SUCCESSFUL AGE
batch-job 1 1 9m
The Job is shown as having completed successfully. But why is that piece of informa-
tion shown as a number instead of as yes or true? And what does the DESIRED column
indicate?
apiVersion: batch/v1
kind: Job
metadata:
name: multi-completion-batch-job Setting completions to
spec: 5 makes this Job run
completions: 5 five pods sequentially.
template:
<template is the same as in listing 4.11>
This Job will run five pods one after the other. It initially creates one pod, and when
the pod’s container finishes, it creates the second pod, and so on, until five pods com-
plete successfully. If one of the pods fails, the Job creates a new pod, so the Job may
create more than five pods overall.
RUNNING JOB PODS IN PARALLEL
Instead of running single Job pods one after the other, you can also make the Job run
multiple pods in parallel. You specify how many pods are allowed to run in parallel
with the parallelism Job spec property, as shown in the following listing.
apiVersion: batch/v1
kind: Job
metadata: This job must ensure
name: multi-completion-batch-job five pods complete
spec: successfully.
completions: 5
parallelism: 2
Up to two pods
template: can run in parallel.
<same as in listing 4.11>
By setting parallelism to 2, the Job creates two pods and runs them in parallel:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
multi-completion-batch-job-lmmnk 1/1 Running 0 21s
multi-completion-batch-job-qx4nq 1/1 Running 0 21s
As soon as one of them finishes, the Job will run the next pod, until five pods finish
successfully.
116 CHAPTER 4 Replication and other controllers: deploying managed pods
SCALING A JOB
You can even change a Job’s parallelism property while the Job is running. This is
similar to scaling a ReplicaSet or ReplicationController, and can be done with the
kubectl scale command:
NOTE You can configure how many times a Job can be retried before it is
marked as failed by specifying the spec.backoffLimit field in the Job mani-
fest. If you don't explicitly specify it, it defaults to 6.
apiVersion: batch/v1beta1
API group is batch,
kind: CronJob version is v1beta1
metadata:
name: batch-job-every-fifteen-minutes
spec: This job should run at the
schedule: "0,15,30,45 * * * *" 0, 15, 30 and 45 minutes of
jobTemplate: every hour, every day.
spec:
template:
metadata:
labels:
The template for the
app: periodic-batch-job
Job resources that
spec:
will be created by
restartPolicy: OnFailure this CronJob
containers:
- name: main
image: luksa/batch-job
As you can see, it’s not too complicated. You’ve specified a schedule and a template
from which the Job objects will be created.
CONFIGURING THE SCHEDULE
If you’re unfamiliar with the cron schedule format, you’ll find great tutorials and
explanations online, but as a quick introduction, from left to right, the schedule con-
tains the following five entries:
Minute
Hour
Day of month
Month
Day of week.
In the example, you want to run the job every 15 minutes, so the schedule needs to be
"0,15,30,45 * * * *", which means at the 0, 15, 30 and 45 minutes mark of every hour
(first asterisk), of every day of the month (second asterisk), of every month (third
asterisk) and on every day of the week (fourth asterisk).
If, instead, you wanted it to run every 30 minutes, but only on the first day of the
month, you’d set the schedule to "0,30 * 1 * *", and if you want it to run at 3AM every
Sunday, you’d set it to "0 3 * * 0" (the last zero stands for Sunday).
CONFIGURING THE JOB TEMPLATE
A CronJob creates Job resources from the jobTemplate property configured in the
CronJob spec, so refer to section 4.5 for more information on how to configure it.
It may happen that the Job or pod is created and run relatively late. You may have
a hard requirement for the job to not be started too far over the scheduled time. In
that case, you can specify a deadline by specifying the startingDeadlineSeconds field
in the CronJob specification as shown in the following listing.
apiVersion: batch/v1beta1
kind: CronJob
spec: At the latest, the pod must
schedule: "0,15,30,45 * * * *"
start running at 15 seconds
past the scheduled time.
startingDeadlineSeconds: 15
...
In the example in listing 4.15, one of the times the job is supposed to run is 10:30:00.
If it doesn’t start by 10:30:15 for whatever reason, the job will not run and will be
shown as Failed.
In normal circumstances, a CronJob always creates only a single Job for each exe-
cution configured in the schedule, but it may happen that two Jobs are created at the
same time, or none at all. To combat the first problem, your jobs should be idempo-
tent (running them multiple times instead of once shouldn’t lead to unwanted
results). For the second problem, make sure that the next job run performs any work
that should have been done by the previous (missed) run.
4.7 Summary
You’ve now learned how to keep pods running and have them rescheduled in the
event of node failures. You should now know that
You can specify a liveness probe to have Kubernetes restart your container as
soon as it’s no longer healthy (where the app defines what’s considered
healthy).
Pods shouldn’t be created directly, because they will not be re-created if they’re
deleted by mistake, if the node they’re running on fails, or if they’re evicted
from the node.
ReplicationControllers always keep the desired number of pod replicas
running.
Scaling pods horizontally is as easy as changing the desired replica count on a
ReplicationController.
Pods aren’t owned by the ReplicationControllers and can be moved between
them if necessary.
A ReplicationController creates new pods from a pod template. Changing the
template has no effect on existing pods.
Summary 119
You’ve learned about pods and how to deploy them through ReplicaSets and similar
resources to ensure they keep running. Although certain pods can do their work
independently of an external stimulus, many applications these days are meant to
respond to external requests. For example, in the case of microservices, pods will
usually respond to HTTP requests coming either from other pods inside the cluster
or from clients outside the cluster.
Pods need a way of finding other pods if they want to consume the services they
provide. Unlike in the non-Kubernetes world, where a sysadmin would configure
120
Introducing services 121
each client app by specifying the exact IP address or hostname of the server providing
the service in the client’s configuration files, doing the same in Kubernetes wouldn’t
work, because
Pods are ephemeral—They may come and go at any time, whether it’s because a
pod is removed from a node to make room for other pods, because someone
scaled down the number of pods, or because a cluster node has failed.
Kubernetes assigns an IP address to a pod after the pod has been scheduled to a node
and before it’s started —Clients thus can’t know the IP address of the server pod
up front.
Horizontal scaling means multiple pods may provide the same service—Each of those
pods has its own IP address. Clients shouldn’t care how many pods are backing
the service and what their IPs are. They shouldn’t have to keep a list of all the
individual IPs of pods. Instead, all those pods should be accessible through a
single IP address.
To solve these problems, Kubernetes also provides another resource type—Services—
that we’ll discuss in this chapter.
change even if the pod’s IP address changes. Additionally, by creating the service, you
also enable the frontend pods to easily find the backend service by its name through
either environment variables or DNS. All the components of your system (the two ser-
vices, the two sets of pods backing those services, and the interdependencies between
them) are shown in figure 5.1.
Frontend components
Frontend service
External client
IP: 1.1.1.1
Backend components
Backend service
IP: 1.1.1.2
Backend pod
IP: 2.1.1.4
Figure 5.1 Both internal and external clients usually connect to pods through services.
You now understand the basic idea behind services. Now, let’s dig deeper by first see-
ing how they can be created.
Service: kubia
Client Selector: app=kubia
app: kubia
Figure 5.2 Label selectors
Pod: kubia-k0xz6
determine which pods belong
to the Service.
apiVersion: v1
kind: Service The port this service
metadata: will be available on
name: kubia
spec:
ports: The container port the
- port: 80
service will forward to
targetPort: 8080
selector: All pods with the app=kubia
app: kubia label will be part of this service.
You’re defining a service called kubia, which will accept connections on port 80 and
route each connection to port 8080 of one of the pods matching the app=kubia
label selector.
Go ahead and create the service by posting the file using kubectl create.
124 CHAPTER 5 Services: enabling clients to discover and talk to pods
The list shows that the IP address assigned to the service is 10.111.249.153. Because
this is the cluster IP, it’s only accessible from inside the cluster. The primary purpose
of services is exposing groups of pods to other pods in the cluster, but you’ll usually
also want to expose services externally. You’ll see how to do that later. For now, let’s
use your service from inside the cluster and see what it does.
TESTING YOUR SERVICE FROM WITHIN THE CLUSTER
You can send requests to your service from within the cluster in a few ways:
The obvious way is to create a pod that will send the request to the service’s
cluster IP and log the response. You can then examine the pod’s log to see
what the service’s response was.
You can ssh into one of the Kubernetes nodes and use the curl command.
You can execute the curl command inside one of your existing pods through
the kubectl exec command.
Let’s go for the last option, so you also learn how to run commands in existing pods.
REMOTELY EXECUTING COMMANDS IN RUNNING CONTAINERS
The kubectl exec command allows you to remotely run arbitrary commands inside
an existing container of a pod. This comes in handy when you want to examine the
contents, state, and/or environment of a container. List the pods with the kubectl
get pods command and choose one as your target for the exec command (in the fol-
lowing example, I’ve chosen the kubia-7nog1 pod as the target). You’ll also need to
obtain the cluster IP of your service (using kubectl get svc, for example). When run-
ning the following commands yourself, be sure to replace the pod name and the ser-
vice IP with your own:
If you’ve used ssh to execute commands on a remote system before, you’ll recognize
that kubectl exec isn’t much different.
Introducing services 125
This has nothing to do with your service refusing the connection. It’s because
kubectl is not able to connect to an API server at 10.111.249.153 (the -s option
is used to tell kubectl to connect to a different API server than the default).
Let’s go over what transpired when you ran the command. Figure 5.3 shows the
sequence of events. You instructed Kubernetes to execute the curl command inside the
container of one of your pods. Curl sent an HTTP request to the service IP, which is
backed by three pods. The Kubernetes service proxy intercepted the connection,
selected a random pod among the three pods, and forwarded the request to it. Node.js
running inside that pod then handled the request and returned an HTTP response con-
taining the pod’s name. Curl then printed the response to the standard output, which
was intercepted and printed to its standard output on your local machine by kubectl.
2. Curl is executed
inside the container Container Container Container
running node.js
1. kubectl exec
curl http://
10.111.249.153
Figure 5.3 Using kubectl exec to test out a connection to the service by running curl in one of the pods
126 CHAPTER 5 Services: enabling clients to discover and talk to pods
In the previous example, you executed the curl command as a separate process, but
inside the pod’s main container. This isn’t much different from the actual main pro-
cess in the container talking to the service.
CONFIGURING SESSION AFFINITY ON THE SERVICE
If you execute the same command a few more times, you should hit a different pod
with every invocation, because the service proxy normally forwards each connection
to a randomly selected backing pod, even if the connections are coming from the
same client.
If, on the other hand, you want all requests made by a certain client to be redi-
rected to the same pod every time, you can set the service’s sessionAffinity property
to ClientIP (instead of None, which is the default), as shown in the following listing.
This makes the service proxy redirect all requests originating from the same client IP
to the same pod. As an exercise, you can create an additional service with session affin-
ity set to ClientIP and try sending requests to it.
Kubernetes supports only two types of service session affinity: None and ClientIP.
You may be surprised it doesn’t have a cookie-based session affinity option, but you
need to understand that Kubernetes services don’t operate at the HTTP level. Services
deal with TCP and UDP packets and don’t care about the payload they carry. Because
cookies are a construct of the HTTP protocol, services don’t know about them, which
explains why session affinity cannot be based on cookies.
EXPOSING MULTIPLE PORTS IN THE SAME SERVICE
Your service exposes only a single port, but services can also support multiple ports. For
example, if your pods listened on two ports—let’s say 8080 for HTTP and 8443 for
HTTPS—you could use a single service to forward both port 80 and 443 to the pod’s
ports 8080 and 8443. You don’t need to create two different services in such cases. Using
a single, multi-port service exposes all the service’s ports through a single cluster IP.
NOTE When creating a service with multiple ports, you must specify a name
for each port.
The spec for a multi-port service is shown in the following listing.
spec:
ports:
- name: http
Port 80 is mapped to
port: 80
the pods’ port 8080.
targetPort: 8080
- name: https
port: 443
Port 443 is mapped to
pods’ port 8443.
targetPort: 8443
selector: The label selector always
app: kubia applies to the whole service.
NOTE The label selector applies to the service as a whole—it can’t be config-
ured for each port individually. If you want different ports to map to different
subsets of pods, you need to create two services.
Because your kubia pods don’t listen on multiple ports, creating a multi-port service
and a multi-port pod is left as an exercise to you.
USING NAMED PORTS
In all these examples, you’ve referred to the target port by its number, but you can also
give a name to each pod’s port and refer to it by name in the service spec. This makes
the service spec slightly clearer, especially if the port numbers aren’t well-known.
For example, suppose your pod defines names for its ports as shown in the follow-
ing listing.
kind: Pod
spec:
containers:
- name: kubia
ports:
- name: http Container’s port
containerPort: 8080 8080 is called http
- name: https
containerPort: 8443
Port 8443 is called https.
You can then refer to those ports by name in the service spec, as shown in the follow-
ing listing.
But why should you even bother with naming ports? The biggest benefit of doing so is
that it enables you to change port numbers later without having to change the service
spec. Your pod currently uses port 8080 for http, but what if you later decide you’d
like to move that to port 80?
If you’re using named ports, all you need to do is change the port number in the
pod spec (while keeping the port’s name unchanged). As you spin up pods with the
new ports, client connections will be forwarded to the appropriate port numbers,
depending on the pod receiving the connection (port 8080 on old pods and port 80
on the new ones).
Now you can list the new pods (I’m sure you know how to do that) and pick one as
your target for the kubectl exec command. Once you’ve selected your target pod,
you can list environment variables by running the env command inside the container,
as shown in the following listing.
Introducing services 129
Two services are defined in your cluster: the kubernetes and the kubia service (you
saw this earlier with the kubectl get svc command); consequently, two sets of service-
related environment variables are in the list. Among the variables that pertain to the
kubia service you created at the beginning of the chapter, you’ll see the KUBIA_SERVICE
_HOST and the KUBIA_SERVICE_PORT environment variables, which hold the IP address
and port of the kubia service, respectively.
Turning back to the frontend-backend example we started this chapter with, when
you have a frontend pod that requires the use of a backend database server pod, you
can expose the backend pod through a service called backend-database and then
have the frontend pod look up its IP address and port through the environment vari-
ables BACKEND_DATABASE_SERVICE_HOST and BACKEND_DATABASE_SERVICE_PORT.
NOTE Dashes in the service name are converted to underscores and all let-
ters are uppercased when the service name is used as the prefix in the envi-
ronment variable’s name.
Environment variables are one way of looking up the IP and port of a service, but isn’t
this usually the domain of DNS? Why doesn’t Kubernetes include a DNS server and
allow you to look up service IPs through DNS instead? As it turns out, it does!
DISCOVERING SERVICES THROUGH DNS
Remember in chapter 3 when you listed pods in the kube-system namespace? One of
the pods was called kube-dns. The kube-system namespace also includes a corre-
sponding service with the same name.
As the name suggests, the pod runs a DNS server, which all other pods running in
the cluster are automatically configured to use (Kubernetes does that by modifying
each container’s /etc/resolv.conf file). Any DNS query performed by a process run-
ning in a pod will be handled by Kubernetes’ own DNS server, which knows all the ser-
vices running in your system.
NOTE Whether a pod uses the internal DNS server or not is configurable
through the dnsPolicy property in each pod’s spec.
Each service gets a DNS entry in the internal DNS server, and client pods that know
the name of the service can access it through its fully qualified domain name (FQDN)
instead of resorting to environment variables.
130 CHAPTER 5 Services: enabling clients to discover and talk to pods
backend-database.default.svc.cluster.local
backend-database corresponds to the service name, default stands for the name-
space the service is defined in, and svc.cluster.local is a configurable cluster
domain suffix used in all cluster local service names.
NOTE The client must still know the service’s port number. If the service is
using a standard port (for example, 80 for HTTP or 5432 for Postgres), that
shouldn’t be a problem. If not, the client can get the port number from the
environment variable.
Connecting to a service can be even simpler than that. You can omit the svc.cluster
.local suffix and even the namespace, when the frontend pod is in the same name-
space as the database pod. You can thus refer to the service simply as backend-
database. That’s incredibly simple, right?
Let’s try this. You’ll try to access the kubia service through its FQDN instead of its
IP. Again, you’ll need to do that inside an existing pod. You already know how to use
kubectl exec to run a single command in a pod’s container, but this time, instead of
running the curl command directly, you’ll run the bash shell instead, so you can then
run multiple commands in the container. This is similar to what you did in chapter 2
when you entered the container you ran with Docker by using the docker exec -it
bash command.
RUNNING A SHELL IN A POD’S CONTAINER
You can use the kubectl exec command to run bash (or any other shell) inside a
pod’s container. This way you’re free to explore the container as long as you want,
without having to perform a kubectl exec for every command you want to run.
NOTE The shell’s binary executable must be available in the container image
for this to work.
To use the shell properly, you need to pass the -it option to kubectl exec:
$ kubectl exec -it kubia-3inly bash
root@kubia-3inly:/#
You’re now inside the container. You can use the curl command to access the kubia
service in any of the following ways:
root@kubia-3inly:/# curl http://kubia.default.svc.cluster.local
You’ve hit kubia-5asi2
You can hit your service by using the service’s name as the hostname in the requested
URL. You can omit the namespace and the svc.cluster.local suffix because of how
the DNS resolver inside each pod’s container is configured. Look at the /etc/resolv.conf
file in the container and you’ll understand:
root@kubia-3inly:/# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local ...
Hmm. curl-ing the service works, but pinging it doesn’t. That’s because the service’s
cluster IP is a virtual IP, and only has meaning when combined with the service port.
We’ll explain what that means and how services work in chapter 11. I wanted to men-
tion that here because it’s the first thing users do when they try to debug a broken
service and it catches most of them off guard.
An Endpoints resource (yes, plural) is a list of IP addresses and ports exposing a ser-
vice. The Endpoints resource is like any other Kubernetes resource, so you can display
its basic info with kubectl get:
$ kubectl get endpoints kubia
NAME ENDPOINTS AGE
kubia 10.108.1.4:8080,10.108.2.5:8080,10.108.2.6:8080 1h
Although the pod selector is defined in the service spec, it’s not used directly when
redirecting incoming connections. Instead, the selector is used to build a list of IPs
and ports, which is then stored in the Endpoints resource. When a client connects to a
service, the service proxy selects one of those IP and port pairs and redirects the
incoming connection to the server listening at that location.
You’re defining a service called external-service that will accept incoming connec-
tions on port 80. You didn’t define a pod selector for the service.
CREATING AN ENDPOINTS RESOURCE FOR A SERVICE WITHOUT A SELECTOR
Endpoints are a separate resource and not an attribute of a service. Because you cre-
ated the service without a selector, the corresponding Endpoints resource hasn’t been
created automatically, so it’s up to you to create it. The following listing shows its
YAML manifest.
apiVersion: v1
kind: Endpoints The name of the Endpoints object
metadata: must match the name of the
name: external-service service (see previous listing).
subsets:
- addresses:
- ip: 11.11.11.11 The IPs of the endpoints that the
- ip: 22.22.22.22 service will forward connections to
ports:
- port: 80 The target port of the endpoints
The Endpoints object needs to have the same name as the service and contain the list
of target IP addresses and ports for the service. After both the Service and the End-
points resource are posted to the server, the service is ready to be used like any regular
service with a pod selector. Containers created after the service is created will include
the environment variables for the service, and all connections to its IP:port pair will be
load balanced between the service’s endpoints.
Figure 5.4 shows three pods connecting to the service with external endpoints.
External server 1
IP: 11.11.11.11:80
Service
10.111.249.214:80
External server 2
IP: 22.22.22.22:80
Kubernetes cluster
Internet
If you later decide to migrate the external service to pods running inside Kubernetes,
you can add a selector to the service, thereby making its Endpoints managed automat-
ically. The same is also true in reverse—by removing the selector from a Service,
134 CHAPTER 5 Services: enabling clients to discover and talk to pods
Kubernetes stops updating its Endpoints. This means a service IP address can remain
constant while the actual implementation of the service is changed.
apiVersion: v1
kind: Service
Service type is set
metadata:
to ExternalName
name: external-service
spec:
type: ExternalName The fully qualified domain
externalName: someapi.somecompany.com name of the actual service
ports:
- port: 80
After the service is created, pods can connect to the external service through the
external-service.default.svc.cluster.local domain name (or even external-
service) instead of using the service’s actual FQDN. This hides the actual service
name and its location from pods consuming the service, allowing you to modify the
service definition and point it to a different service any time later, by only changing
the externalName attribute or by changing the type back to ClusterIP and creating
an Endpoints object for the service—either manually or by specifying a label selector
on the service and having it created automatically.
ExternalName services are implemented solely at the DNS level—a simple CNAME
DNS record is created for the service. Therefore, clients connecting to the service will
connect to the external service directly, bypassing the service proxy completely. For
this reason, these types of services don’t even get a cluster IP.
Kubernetes cluster
You set the type to NodePort and specify the node port this service should be bound to
across all cluster nodes. Specifying the port isn’t mandatory; Kubernetes will choose a
random port if you omit it.
NOTE When you create the service in GKE, kubectl prints out a warning
about having to configure firewall rules. We’ll see how to do that soon.
Look at the EXTERNAL-IP column. It shows <nodes>, indicating the service is accessible
through the IP address of any cluster node. The PORT(S) column shows both the
internal port of the cluster IP (80) and the node port (30123). The service is accessi-
ble at the following addresses:
10.11.254.223:80
<1st node’s IP>:30123
<2nd node’s IP>:30123, and so on.
Figure 5.6 shows your service exposed on port 30123 of both of your cluster nodes
(this applies if you’re running this on GKE; Minikube only has a single node, but the
principle is the same). An incoming connection to one of those ports will be redi-
rected to a randomly selected pod, which may or may not be the one running on the
node the connection is being made to.
Exposing services to external clients 137
External client
Service
Node 1 Node 2
IP: 130.211.97.55 IP: 130.211.99.206
Kubernetes cluster
Figure 5.6 An external client connecting to a NodePort service either through Node 1 or 2
A connection received on port 30123 of the first node might be forwarded either to
the pod running on the first node or to one of the pods running on the second node.
CHANGING FIREWALL RULES TO LET EXTERNAL CLIENTS ACCESS OUR NODEPORT SERVICE
As I’ve mentioned previously, before you can access your service through the node
port, you need to configure the Google Cloud Platform’s firewalls to allow external
connections to your nodes on that port. You’ll do this now:
You can access your service through port 30123 of one of the node’s IPs. But you need
to figure out the IP of a node first. Refer to the sidebar on how to do that.
138 CHAPTER 5 Services: enabling clients to discover and talk to pods
You’re telling kubectl to only output the information you want by specifying a
JSONPath. You’re probably familiar with XPath and how it’s used with XML. JSONPath
is basically XPath for JSON. The JSONPath in the previous example instructs kubectl
to do the following:
Go through all the elements in the items attribute.
For each element, enter the status attribute.
Filter elements of the addresses attribute, taking only those that have the
type attribute set to ExternalIP.
Finally, print the address attribute of the filtered elements.
To learn more about how to use JSONPath with kubectl, refer to the documentation
at http://kubernetes.io/docs/user-guide/jsonpath.
Once you know the IPs of your nodes, you can try accessing your service through them:
$ curl http://130.211.97.55:30123
You've hit kubia-ym8or
$ curl http://130.211.99.206:30123
You've hit kubia-xueq1
TIP When using Minikube, you can easily access your NodePort services
through your browser by running minikube service <service-name> [-n
<namespace>].
As you can see, your pods are now accessible to the whole internet through port 30123
on any of your nodes. It doesn’t matter what node a client sends the request to. But if
you only point your clients to the first node, when that node fails, your clients can’t
access the service anymore. That’s why it makes sense to put a load balancer in front
of the nodes to make sure you’re spreading requests across all healthy nodes and
never sending them to a node that’s offline at that moment.
If your Kubernetes cluster supports it (which is mostly true when Kubernetes is
deployed on cloud infrastructure), the load balancer can be provisioned automati-
cally by creating a LoadBalancer instead of a NodePort service. We’ll look at this next.
service’s type to LoadBalancer instead of NodePort. The load balancer will have its
own unique, publicly accessible IP address and will redirect all connections to your
service. You can thus access your service through the load balancer’s IP address.
If Kubernetes is running in an environment that doesn’t support LoadBalancer
services, the load balancer will not be provisioned, but the service will still behave like
a NodePort service. That’s because a LoadBalancer service is an extension of a Node-
Port service. You’ll run this example on Google Kubernetes Engine, which supports
LoadBalancer services. Minikube doesn’t, at least not as of this writing.
apiVersion: v1
kind: Service
metadata:
name: kubia-loadbalancer This type of service obtains
spec: a load balancer from the
type: LoadBalancer infrastructure hosting the
ports: Kubernetes cluster.
- port: 80
targetPort: 8080
selector:
app: kubia
The service type is set to LoadBalancer instead of NodePort. You’re not specifying a spe-
cific node port, although you could (you’re letting Kubernetes choose one instead).
CONNECTING TO THE SERVICE THROUGH THE LOAD BALANCER
After you create the service, it takes time for the cloud infrastructure to create the
load balancer and write its IP address into the Service object. Once it does that, the IP
address will be listed as the external IP address of your service:
$ kubectl get svc kubia-loadbalancer
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubia-loadbalancer 10.111.241.153 130.211.53.173 80:32143/TCP 1m
In this case, the load balancer is available at IP 130.211.53.173, so you can now access
the service at that IP address:
$ curl http://130.211.53.173
You've hit kubia-xueq1
Success! As you may have noticed, this time you didn’t need to mess with firewalls the
way you had to before with the NodePort service.
140 CHAPTER 5 Services: enabling clients to discover and talk to pods
See figure 5.7 to see how HTTP requests are delivered to the pod. External clients
(curl in your case) connect to port 80 of the load balancer and get routed to the
External client
Load balancer
IP: 130.211.53.173:80
Service
Node 1 Node 2
IP: 130.211.97.55 IP: 130.211.99.206
Kubernetes cluster
implicitly assigned node port on one of the nodes. From there, the connection is for-
warded to one of the pod instances.
As already mentioned, a LoadBalancer-type service is a NodePort service with an
additional infrastructure-provided load balancer. If you use kubectl describe to dis-
play additional info about the service, you’ll see that a node port has been selected for
the service. If you were to open the firewall for this port, the way you did in the previ-
ous section about NodePort services, you could access the service through the node
IPs as well.
TIP If you’re using Minikube, even though the load balancer will never be
provisioned, you can still access the service through the node port (at the
Minikube VM’s IP address).
spec:
externalTrafficPolicy: Local
...
Load balancer
50% 50%
Let me first explain why you need another way to access Kubernetes services from the
outside.
UNDERSTANDING WHY INGRESSES ARE NEEDED
One important reason is that each LoadBalancer service requires its own load bal-
ancer with its own public IP address, whereas an Ingress only requires one, even when
providing access to dozens of services. When a client sends an HTTP request to the
Ingress, the host and path in the request determine which service the request is for-
warded to, as shown in figure 5.9.
Exposing services externally through an Ingress resource 143
kubia.example.com/kubia
Service Pod Pod Pod
kubia.example.com/foo
Service Pod Pod Pod
Client Ingress
foo.example.com
Service Pod Pod Pod
bar.example.com
Service Pod Pod Pod
Ingresses operate at the application layer of the network stack (HTTP) and can pro-
vide features such as cookie-based session affinity and the like, which services can’t.
UNDERSTANDING THAT AN INGRESS CONTROLLER IS REQUIRED
Before we go into the features an Ingress object provides, let me emphasize that to
make Ingress resources work, an Ingress controller needs to be running in the cluster.
Different Kubernetes environments use different implementations of the controller,
but several don’t provide a default controller at all.
For example, Google Kubernetes Engine uses Google Cloud Platform’s own HTTP
load-balancing features to provide the Ingress functionality. Initially, Minikube didn’t
provide a controller out of the box, but it now includes an add-on that can be enabled
to let you try out the Ingress functionality. Follow the instructions in the following
sidebar to ensure it’s enabled.
You’ll learn about what these add-ons are throughout the book, but it should be
pretty clear what the dashboard and the kube-dns add-ons do. Enable the Ingress
add-on so you can see Ingresses in action:
$ minikube addons enable ingress
ingress was successfully enabled
144 CHAPTER 5 Services: enabling clients to discover and talk to pods
(continued)
This should have spun up an Ingress controller as another pod. Most likely, the
controller pod will be in the kube-system namespace, but not necessarily, so list all
the running pods across all namespaces by using the --all-namespaces option:
$ kubectl get po --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default kubia-rsv5m 1/1 Running 0 13h
default kubia-fe4ad 1/1 Running 0 13h
default kubia-ke823 1/1 Running 0 13h
kube-system default-http-backend-5wb0h 1/1 Running 0 18m
kube-system kube-addon-manager-minikube 1/1 Running 3 6d
kube-system kube-dns-v20-101vq 3/3 Running 9 6d
kube-system kubernetes-dashboard-jxd9l 1/1 Running 3 6d
kube-system nginx-ingress-controller-gdts0 1/1 Running 0 18m
At the bottom of the output, you see the Ingress controller pod. The name suggests
that Nginx (an open-source HTTP server and reverse proxy) is used to provide the
Ingress functionality.
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: kubia
spec: This Ingress maps the
rules:
kubia.example.com domain
name to your service.
- host: kubia.example.com
http:
paths:
- path: / All requests will be sent to
backend: port 80 of the kubia-
serviceName: kubia-nodeport nodeport service.
servicePort: 80
This defines an Ingress with a single rule, which makes sure all HTTP requests received
by the Ingress controller, in which the host kubia.example.com is requested, will be
sent to the kubia-nodeport service on port 80.
Exposing services externally through an Ingress resource 145
NOTE Ingress controllers on cloud providers (in GKE, for example) require
the Ingress to point to a NodePort service. But that’s not a requirement of
Kubernetes itself.
NOTE When running on cloud providers, the address may take time to appear,
because the Ingress controller provisions a load balancer behind the scenes.
192.168.99.100 kubia.example.com
You’ve successfully accessed the service through an Ingress. Let’s take a better look at
how that unfolded.
UNDERSTANDING HOW INGRESSES WORK
Figure 5.10 shows how the client connected to one of the pods through the Ingress
controller. The client first performed a DNS lookup of kubia.example.com, and the
DNS server (or the local operating system) returned the IP of the Ingress controller.
The client then sent an HTTP request to the Ingress controller and specified
kubia.example.com in the Host header. From that header, the controller determined
which service the client is trying to access, looked up the pod IPs through the End-
points object associated with the service, and forwarded the client’s request to one of
the pods.
As you can see, the Ingress controller didn’t forward the request to the service. It
only used it to select a pod. Most, if not all, controllers work like this.
146 CHAPTER 5 Services: enabling clients to discover and talk to pods
DNS
Node A Node B
Listing 5.14 Ingress exposing multiple services on same host, but different paths
...
- host: kubia.example.com
http:
paths:
- path: /kubia
backend: Requests to kubia.example.com/kubia
serviceName: kubia will be routed to the kubia service.
servicePort: 80
- path: /foo
backend: Requests to kubia.example.com/bar
serviceName: bar will be routed to the bar service.
servicePort: 80
In this case, requests will be sent to two different services, depending on the path in
the requested URL. Clients can therefore reach two different services through a single
IP address (that of the Ingress controller).
Exposing services externally through an Ingress resource 147
spec:
rules:
- host: foo.example.com
http:
paths:
Requests for
foo.example.com will be
- path: /
routed to service foo.
backend:
serviceName: foo
servicePort: 80
- host: bar.example.com
http:
paths:
Requests for
bar.example.com will be
- path: /
routed to service bar.
backend:
serviceName: bar
servicePort: 80
Requests received by the controller will be forwarded to either service foo or bar,
depending on the Host header in the request (the way virtual hosts are handled in
web servers). DNS needs to point both the foo.example.com and the bar.exam-
ple.com domain names to the Ingress controller’s IP address.
Then you create the Secret from the two files like this:
$ kubectl create secret tls tls-secret --cert=tls.cert --key=tls.key
secret "tls-secret" created
The signed certificate can then be retrieved from the CSR’s status.certificate
field.
Note that a certificate signer component must be running in the cluster; otherwise
creating CertificateSigningRequest and approving or denying them won’t have
any effect.
The private key and the certificate are now stored in the Secret called tls-secret.
Now, you can update your Ingress object so it will also accept HTTPS requests for
kubia.example.com. The Ingress manifest should now look like the following listing.
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: kubia The whole TLS configuration
spec:
is under this attribute.
tls:
- hosts: TLS connections will be accepted for
- kubia.example.com the kubia.example.com hostname.
secretName: tls-secret
rules:
- host: kubia.example.com The private key and the certificate
http:
should be obtained from the tls-
secret you created previously.
paths:
- path: /
backend:
serviceName: kubia-nodeport
servicePort: 80
TIP Instead of deleting the Ingress and re-creating it from the new file, you
can invoke kubectl apply -f kubia-ingress-tls.yaml, which updates the
Ingress resource with what’s specified in the file.
Signaling when a pod is ready to accept connections 149
You can now use HTTPS to access your service through the Ingress:
$ curl -k -v https://kubia.example.com/kubia
* About to connect() to kubia.example.com port 443 (#0)
...
* Server certificate:
* subject: CN=kubia.example.com
...
> GET /kubia HTTP/1.1
> ...
You've hit kubia-xueq1
The command’s output shows the response from the app, as well as the server certifi-
cate you configured the Ingress with.
NOTE Support for Ingress features varies between the different Ingress con-
troller implementations, so check the implementation-specific documenta-
tion to see what’s supported.
Ingresses are a relatively new Kubernetes feature, so you can expect to see many
improvements and new features in the future. Although they currently support only
L7 (HTTP/HTTPS) load balancing, support for L4 load balancing is also planned.
GET / request or it can hit a specific URL path, which causes the app to perform a
whole list of checks to determine if it’s ready. Such a detailed readiness probe, which
takes the app’s specifics into account, is the app developer’s responsibility.
TYPES OF READINESS PROBES
Like liveness probes, three types of readiness probes exist:
An Exec probe, where a process is executed. The container’s status is deter-
mined by the process’ exit status code.
An HTTP GET probe, which sends an HTTP GET request to the container and
the HTTP status code of the response determines whether the container is
ready or not.
A TCP Socket probe, which opens a TCP connection to a specified port of the
container. If the connection is established, the container is considered ready.
UNDERSTANDING THE OPERATION OF READINESS PROBES
When a container is started, Kubernetes can be configured to wait for a configurable
amount of time to pass before performing the first readiness check. After that, it
invokes the probe periodically and acts based on the result of the readiness probe. If a
pod reports that it’s not ready, it’s removed from the service. If the pod then becomes
ready again, it’s re-added.
Unlike liveness probes, if a container fails the readiness check, it won’t be killed or
restarted. This is an important distinction between liveness and readiness probes.
Liveness probes keep pods healthy by killing off unhealthy containers and replacing
them with new, healthy ones, whereas readiness probes make sure that only pods that
are ready to serve requests receive them. This is mostly necessary during container
start up, but it’s also useful after the container has been running for a while.
As you can see in figure 5.11, if a pod’s readiness probe fails, the pod is removed
from the Endpoints object. Clients connecting to the service will not be redirected to
the pod. The effect is the same as when the pod doesn’t match the service’s label
selector at all.
Figure 5.11 A pod whose readiness probe fails is removed as an endpoint of a service.
Signaling when a pod is ready to accept connections 151
When the ReplicationController’s YAML opens in the text editor, find the container
specification in the pod template and add the following readiness probe definition to
the first container under spec.template.spec.containers. The YAML should look
like the following listing.
apiVersion: v1
kind: ReplicationController
...
spec:
...
template:
...
spec:
containers:
- name: kubia
image: luksa/kubia
readinessProbe:
exec: A readinessProbe may
command: be defined for each
- ls container in the pod.
- /var/ready
...
The readiness probe will periodically perform the command ls /var/ready inside the
container. The ls command returns exit code zero if the file exists, or a non-zero exit
code otherwise. If the file exists, the readiness probe will succeed; otherwise, it will fail.
152 CHAPTER 5 Services: enabling clients to discover and talk to pods
The reason you’re defining such a strange readiness probe is so you can toggle its
result by creating or removing the file in question. The file doesn’t exist yet, so all the
pods should now report not being ready, right? Well, not exactly. As you may remem-
ber from the previous chapter, changing a ReplicationController’s pod template has
no effect on existing pods.
In other words, all your existing pods still have no readiness probe defined. You
can see this by listing the pods with kubectl get pods and looking at the READY col-
umn. You need to delete the pods and have them re-created by the Replication-
Controller. The new pods will fail the readiness check and won’t be included as
endpoints of the service until you create the /var/ready file in each of them.
OBSERVING AND MODIFYING THE PODS’ READINESS STATUS
List the pods again and inspect whether they’re ready or not:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-2r1qb 0/1 Running 0 1m
kubia-3rax1 0/1 Running 0 1m
kubia-3yw4s 0/1 Running 0 1m
The READY column shows that none of the containers are ready. Now make the readi-
ness probe of one of them start returning success by creating the /var/ready file,
whose existence makes your mock readiness probe succeed:
You’ve used the kubectl exec command to execute the touch command inside the
container of the kubia-2r1qb pod. The touch command creates the file if it doesn’t
yet exist. The pod’s readiness probe command should now exit with status code 0,
which means the probe is successful, and the pod should now be shown as ready. Let’s
see if it is:
$ kubectl get po kubia-2r1qb
NAME READY STATUS RESTARTS AGE
kubia-2r1qb 0/1 Running 0 2m
The pod still isn’t ready. Is there something wrong or is this the expected result? Take
a more detailed look at the pod with kubectl describe. The output should contain
the following line:
Readiness: exec [ls /var/ready] delay=0s timeout=1s period=10s #success=1
➥ #failure=3
Even though there are three pods running, only a single pod is reporting as being
ready and is therefore the only pod receiving requests. If you now delete the file, the
pod will be removed from the service again.
TIP If you want to add or remove a pod from a service manually, add
enabled=true as a label to your pod and to the label selector of your service.
Remove the label when you want to remove the pod from the service.
ALWAYS DEFINE A READINESS PROBE
Before we conclude this section, there are two final notes about readiness probes that
I need to emphasize. First, if you don’t add a readiness probe to your pods, they’ll
become service endpoints almost immediately. If your application takes too long to
start listening for incoming connections, client requests hitting the service will be for-
warded to the pod while it’s still starting up and not ready to accept incoming connec-
tions. Clients will therefore see “Connection refused” types of errors.
TIP You should always define a readiness probe, even if it’s as simple as send-
ing an HTTP request to the base URL.
DON’T INCLUDE POD SHUTDOWN LOGIC INTO YOUR READINESS PROBES
The other thing I need to mention applies to the other end of the pod’s life (pod
shutdown) and is also related to clients experiencing connection errors.
When a pod is being shut down, the app running in it usually stops accepting con-
nections as soon as it receives the termination signal. Because of this, you might think
you need to make your readiness probe start failing as soon as the shutdown proce-
dure is initiated, ensuring the pod is removed from all services it’s part of. But that’s
not necessary, because Kubernetes removes the pod from all services as soon as you
delete the pod.
154 CHAPTER 5 Services: enabling clients to discover and talk to pods
apiVersion: v1
kind: Service
metadata:
name: kubia-headless
spec: This makes the
clusterIP: None service headless.
ports:
- port: 80
targetPort: 8080
selector:
app: kubia
After you create the service with kubectl create, you can inspect it with kubectl get
and kubectl describe. You’ll see it has no cluster IP and its endpoints include (part of)
Using a headless service for discovering individual pods 155
the pods matching its pod selector. I say “part of” because your pods contain a readi-
ness probe, so only pods that are ready will be listed as endpoints of the service.
Before continuing, please make sure at least two pods report being ready, by creating
the /var/ready file, as in the previous example:
The trick is in the --generator=run-pod/v1 option, which tells kubectl to create the
pod directly, without any kind of ReplicationController or similar behind it.
UNDERSTANDING DNS A RECORDS RETURNED FOR A HEADLESS SERVICE
Let’s use the newly created pod to perform a DNS lookup:
The DNS server returns two different IPs for the kubia-headless.default.svc
.cluster.local FQDN. Those are the IPs of the two pods that are reporting being
ready. You can confirm this by listing pods with kubectl get pods -o wide, which
shows the pods’ IPs.
156 CHAPTER 5 Services: enabling clients to discover and talk to pods
This is different from what DNS returns for regular (non-headless) services, such
as for your kubia service, where the returned IP is the service’s cluster IP:
$ kubectl exec dnsutils nslookup kubia
...
Name: kubia.default.svc.cluster.local
Address: 10.111.249.153
Although headless services may seem different from regular services, they aren’t that
different from the clients’ perspective. Even with a headless service, clients can con-
nect to its pods by connecting to the service’s DNS name, as they can with regular ser-
vices. But with headless services, because DNS returns the pods’ IPs, clients connect
directly to the pods, instead of through the service proxy.
NOTE A headless services still provides load balancing across pods, but through
the DNS round-robin mechanism instead of through the service proxy.
WARNING As the annotation name suggests, as I’m writing this, this is an alpha
feature. The Kubernetes Service API already supports a new service spec field
called publishNotReadyAddresses, which will replace the tolerate-unready-
endpoints annotation. In Kubernetes version 1.9.0, the field is not honored yet
(the annotation is what determines whether unready endpoints are included in
the DNS or not). Check the documentation to see whether that’s changed.
First, make sure you’re connecting to the service’s cluster IP from within the
cluster, not from the outside.
Don’t bother pinging the service IP to figure out if the service is accessible
(remember, the service’s cluster IP is a virtual IP and pinging it will never work).
If you’ve defined a readiness probe, make sure it’s succeeding; otherwise the
pod won’t be part of the service.
To confirm that a pod is part of the service, examine the corresponding End-
points object with kubectl get endpoints.
If you’re trying to access the service through its FQDN or a part of it (for exam-
ple, myservice.mynamespace.svc.cluster.local or myservice.mynamespace) and
it doesn’t work, see if you can access it using its cluster IP instead of the FQDN.
Check whether you’re connecting to the port exposed by the service and not
the target port.
Try connecting to the pod IP directly to confirm your pod is accepting connec-
tions on the correct port.
If you can’t even access your app through the pod’s IP, make sure your app isn’t
only binding to localhost.
This should help you resolve most of your service-related problems. You’ll learn much
more about how services work in chapter 11. By understanding exactly how they’re
implemented, it should be much easier for you to troubleshoot them.
5.8 Summary
In this chapter, you’ve learned how to create Kubernetes Service resources to expose
the services available in your application, regardless of how many pod instances are
providing each service. You’ve learned how Kubernetes
Exposes multiple pods that match a certain label selector under a single, stable
IP address and port
Makes services accessible from inside the cluster by default, but allows you to
make the service accessible from outside the cluster by setting its type to either
NodePort or LoadBalancer
Enables pods to discover services together with their IP addresses and ports by
looking up environment variables
Allows discovery of and communication with services residing outside the
cluster by creating a Service resource without specifying a selector, by creating
an associated Endpoints resource instead
Provides a DNS CNAME alias for external services with the ExternalName ser-
vice type
Exposes multiple HTTP services through a single Ingress (consuming a sin-
gle IP)
158 CHAPTER 5 Services: enabling clients to discover and talk to pods
Along with getting a better understanding of services, you’ve also learned how to
Troubleshoot them
Modify firewall rules in Google Kubernetes/Compute Engine
Execute commands in pod containers through kubectl exec
Run a bash shell in an existing pod’s container
Modify Kubernetes resources through the kubectl apply command
Run an unmanaged ad hoc pod with kubectl run --generator=run-pod/v1
Volumes: attaching
disk storage to containers
In the previous three chapters, we introduced pods and other Kubernetes resources
that interact with them, namely ReplicationControllers, ReplicaSets, DaemonSets,
Jobs, and Services. Now, we’re going back inside the pod to learn how its containers
can access external disk storage and/or share storage between them.
We’ve said that pods are similar to logical hosts where processes running inside
them share resources such as CPU, RAM, network interfaces, and others. One
would expect the processes to also share disks, but that’s not the case. You’ll remem-
ber that each container in a pod has its own isolated filesystem, because the file-
system comes from the container’s image.
159
160 CHAPTER 6 Volumes: attaching disk storage to containers
Every new container starts off with the exact set of files that was added to the image
at build time. Combine this with the fact that containers in a pod get restarted (either
because the process died or because the liveness probe signaled to Kubernetes that
the container wasn’t healthy anymore) and you’ll realize that the new container will
not see anything that was written to the filesystem by the previous container, even
though the newly started container runs in the same pod.
In certain scenarios you want the new container to continue where the last one fin-
ished, such as when restarting a process on a physical machine. You may not need (or
want) the whole filesystem to be persisted, but you do want to preserve the directories
that hold actual data.
Kubernetes provides this by defining storage volumes. They aren’t top-level resources
like pods, but are instead defined as a part of a pod and share the same lifecycle as the
pod. This means a volume is created when the pod is started and is destroyed when
the pod is deleted. Because of this, a volume’s contents will persist across container
restarts. After a container is restarted, the new container can see all the files that were
written to the volume by the previous container. Also, if a pod contains multiple con-
tainers, the volume can be used by all of them at once.
Container: WebServer
Filesystem Webserver
/ process
var/
Reads
htdocs/
logs/
Writes
Container: ContentAgent
Filesystem ContentAgent
/ process
var/
html/
Writes
Container: LogRotator
Filesystem LogRotator
/ process
var/
logs/
Reads
Container: WebServer
Filesystem
/
var/
htdocs/
logs/
Volume:
Container: ContentAgent publicHtml
Filesystem
/
var/
html/
Volume:
logVol
Container: LogRotator
Filesystem
/
var/
logs/
than the sum of its parts. Linux allows you to mount a filesystem at arbitrary locations
in the file tree. When you do that, the contents of the mounted filesystem are accessi-
ble in the directory it’s mounted into. By mounting the same volume into two contain-
ers, they can operate on the same files. In your case, you’re mounting two volumes in
three containers. By doing this, your three containers can work together and do some-
thing useful. Let me explain how.
First, the pod has a volume called publicHtml. This volume is mounted in the Web-
Server container at /var/htdocs, because that’s the directory the web server serves
files from. The same volume is also mounted in the ContentAgent container, but at
/var/html, because that’s where the agent writes the files to. By mounting this single vol-
ume like that, the web server will now serve the content generated by the content agent.
Similarly, the pod also has a volume called logVol for storing logs. This volume is
mounted at /var/logs in both the WebServer and the LogRotator containers. Note
that it isn’t mounted in the ContentAgent container. The container cannot access its
files, even though the container and the volume are part of the same pod. It’s not
enough to define a volume in the pod; you need to define a VolumeMount inside the
container’s spec also, if you want the container to be able to access it.
The two volumes in this example can both initially be empty, so you can use a type
of volume called emptyDir. Kubernetes also supports other types of volumes that are
either populated during initialization of the volume from an external source, or an
existing directory is mounted inside the volume. This process of populating or mount-
ing a volume is performed before the pod’s containers are started.
A volume is bound to the lifecycle of a pod and will stay in existence only while the
pod exists, but depending on the volume type, the volume’s files may remain intact
even after the pod and volume disappear, and can later be mounted into a new vol-
ume. Let’s see what types of volumes exist.
Docker Hub, but you’ll need to either create the fortune image yourself or use the
one I’ve already built and pushed to Docker Hub under luksa/fortune. If you want a
refresher on how to build Docker images, refer to the sidebar.
Then, in the same directory, create a file called Dockerfile containing the following:
FROM ubuntu:latest
RUN apt-get update ; apt-get -y install fortune
ADD fortuneloop.sh /bin/fortuneloop.sh
ENTRYPOINT /bin/fortuneloop.sh
The image is based on the ubuntu:latest image, which doesn’t include the fortune
binary by default. That’s why in the second line of the Dockerfile you install it with
apt-get. After that, you add the fortuneloop.sh script to the image’s /bin folder.
In the last line of the Dockerfile, you specify that the fortuneloop.sh script should
be executed when the image is run.
After preparing both files, build and upload the image to Docker Hub with the following
two commands (replace luksa with your own Docker Hub user ID):
$ docker build -t luksa/fortune .
$ docker push luksa/fortune
Listing 6.1 A pod with two containers sharing the same volume: fortune-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: fortune
spec:
containers:
Using volumes to share data between containers 165
The pod contains two containers and a single volume that’s mounted in both of
them, yet at different paths. When the html-generator container starts, it starts writ-
ing the output of the fortune command to the /var/htdocs/index.html file every 10
seconds. Because the volume is mounted at /var/htdocs, the index.html file is writ-
ten to the volume instead of the container’s top layer. As soon as the web-server con-
tainer starts, it starts serving whatever HTML files are in the /usr/share/nginx/html
directory (this is the default directory Nginx serves files from). Because you mounted
the volume in that exact location, Nginx will serve the index.html file written there
by the container running the fortune loop. The end effect is that a client sending an
HTTP request to the pod on port 80 will receive the current fortune message as
the response.
SEEING THE POD IN ACTION
To see the fortune message, you need to enable access to the pod. You’ll do that by
forwarding a port from your local machine to the pod:
$ kubectl port-forward fortune 8080:80
Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80
NOTE As an exercise, you can also expose the pod through a service instead
of using port forwarding.
Now you can access the Nginx server through port 8080 of your local machine. Use
curl to do that:
$ curl http://localhost:8080
Beware of a tall blond man with one black shoe.
If you wait a few seconds and send another request, you should receive a different
message. By combining two containers, you created a simple app to see how a volume
can glue together two containers and enhance what each of them does.
166 CHAPTER 6 Volumes: attaching disk storage to containers
An emptyDir volume is the simplest type of volume, but other types build upon it.
After the empty directory is created, they populate it with data. One such volume type
is the gitRepo volume type, which we’ll introduce next.
User
2. Kubernetes creates
an empty directory and
clones the specified Git
gitRepo repository into it
volume
Repository
3. The pod’s container is started
(with the volume mounted at
the mount path)
Pod
Figure 6.3 A gitRepo volume is an emptyDir volume initially populated with the contents of a
Git repository.
NOTE After the gitRepo volume is created, it isn’t kept in sync with the repo
it’s referencing. The files in the volume will not be updated when you push
additional commits to the Git repository. However, if your pod is managed by
a ReplicationController, deleting the pod will result in a new pod being cre-
ated and this new pod’s volume will then contain the latest commits.
For example, you can use a Git repository to store static HTML files of your website
and create a pod containing a web server container and a gitRepo volume. Every time
the pod is created, it pulls the latest version of your website and starts serving it. The
Using volumes to share data between containers 167
only drawback to this is that you need to delete the pod every time you push changes
to the gitRepo and want to start serving the new version of the website.
Let’s do this right now. It’s not that different from what you did before.
RUNNING A WEB SERVER POD SERVING FILES FROM A CLONED GIT REPOSITORY
Before you create your pod, you’ll need an actual Git repository with HTML files in it.
I’ve created a repo on GitHub at https://github.com/luksa/kubia-website-example.git.
You’ll need to fork it (create your own copy of the repo on GitHub) so you can push
changes to it later.
Once you’ve created your fork, you can move on to creating the pod. This time,
you’ll only need a single Nginx container and a single gitRepo volume in the pod (be
sure to point the gitRepo volume to your own fork of my repository), as shown in the
following listing.
apiVersion: v1
kind: Pod
metadata:
name: gitrepo-volume-pod
spec:
containers:
- image: nginx:alpine
name: web-server
volumeMounts:
- name: html
mountPath: /usr/share/nginx/html
readOnly: true
ports:
- containerPort: 80
protocol: TCP
volumes: You’re creating a The volume will clone
- name: html
gitRepo volume. this Git repository.
gitRepo:
repository: https://github.com/luksa/kubia-website-example.git
revision: master
directory: . The master branch
You want the repo to will be checked out.
be cloned into the root
dir of the volume.
When you create the pod, the volume is first initialized as an empty directory and then
the specified Git repository is cloned into it. If you hadn’t set the directory to . (dot),
the repository would have been cloned into the kubia-website-example subdirectory,
which isn’t what you want. You want the repo to be cloned into the root directory of
your volume. Along with the repository, you also specified you want Kubernetes to
check out whatever revision the master branch is pointing to at the time the volume
is created.
With the pod running, you can try hitting it through port forwarding, a service, or by
executing the curl command from within the pod (or any other pod inside the cluster).
168 CHAPTER 6 Volumes: attaching disk storage to containers
CONFIRMING THE FILES AREN’T KEPT IN SYNC WITH THE GIT REPO
Now you’ll make changes to the index.html file in your GitHub repository. If you
don’t use Git locally, you can edit the file on GitHub directly—click on the file in your
GitHub repository to open it and then click on the pencil icon to start editing it.
Change the text and then commit the changes by clicking the button at the bottom.
The master branch of the Git repository now includes the changes you made to the
HTML file. These changes will not be visible on your Nginx web server yet, because
the gitRepo volume isn’t kept in sync with the Git repository. You can confirm this by
hitting the pod again.
To see the new version of the website, you need to delete the pod and create
it again. Instead of having to delete the pod every time you make changes, you could
run an additional process, which keeps your volume in sync with the Git repository.
I won’t explain in detail how to do this. Instead, try doing this yourself as an exer-
cise, but here are a few pointers.
INTRODUCING SIDECAR CONTAINERS
The Git sync process shouldn’t run in the same container as the Nginx web server, but
in a second container: a sidecar container. A sidecar container is a container that aug-
ments the operation of the main container of the pod. You add a sidecar to a pod so
you can use an existing container image instead of cramming additional logic into the
main app’s code, which would make it overly complex and less reusable.
To find an existing container image, which keeps a local directory synchronized
with a Git repository, go to Docker Hub and search for “git sync.” You’ll find many
images that do that. Then use the image in a new container in the pod from the previ-
ous example, mount the pod’s existing gitRepo volume in the new container, and
configure the Git sync container to keep the files in sync with your Git repo. If you set
everything up correctly, you should see that the files the web server is serving are kept
in sync with your GitHub repo.
NOTE An example in chapter 18 includes using a Git sync container like the
one explained here, so you can wait until you reach chapter 18 and follow the
step-by-step instructions then instead of doing this exercise on your own now.
Node 1 Node 2
/some/path/on/host /some/path/on/host
Figure 6.4 A hostPath volume mounts a file or directory on the worker node into
the container’s filesystem.
hostPath volumes are the first type of persistent storage we’re introducing, because
both the gitRepo and emptyDir volumes’ contents get deleted when a pod is torn
down, whereas a hostPath volume’s contents don’t. If a pod is deleted and the next
pod uses a hostPath volume pointing to the same path on the host, the new pod will
see whatever was left behind by the previous pod, but only if it’s scheduled to the same
node as the first pod.
170 CHAPTER 6 Volumes: attaching disk storage to containers
Pick the first one and see what kinds of volumes it uses (shown in the following listing).
Listing 6.3 A pod using hostPath volumes to access the node’s logs
Aha! The pod uses two hostPath volumes to gain access to the node’s /var/log and
the /var/lib/docker/containers directories. You’d think you were lucky to find a pod
using a hostPath volume on the first try, but not really (at least not on GKE). Check
the other pods, and you’ll see most use this type of volume either to access the node’s
log files, kubeconfig (the Kubernetes config file), or the CA certificates.
If you inspect the other pods, you’ll see none of them uses the hostPath volume
for storing their own data. They all use it to get access to the node’s data. But as we’ll
see later in the chapter, hostPath volumes are often used for trying out persistent stor-
age in single-node clusters, such as the one created by Minikube. Read on to learn
about the types of volumes you should use for storing persistent data properly even in
a multi-node cluster.
Using persistent storage 171
TIP Remember to use hostPath volumes only if you need to read or write sys-
tem files on the node. Never use them to persist data across pods.
This shows you’ve created your cluster in zone europe-west1-b, so you need to create
the GCE persistent disk in the same zone as well. You create the disk like this:
$ gcloud compute disks create --size=1GiB --zone=europe-west1-b mongodb
WARNING: You have selected a disk size of under [200GB]. This may result in
poor I/O performance. For more information, see:
https://developers.google.com/compute/docs/disks#pdperformance.
Created [https://www.googleapis.com/compute/v1/projects/rapid-pivot-
136513/zones/europe-west1-b/disks/mongodb].
NAME ZONE SIZE_GB TYPE STATUS
mongodb europe-west1-b 1 pd-standard READY
172 CHAPTER 6 Volumes: attaching disk storage to containers
This command creates a 1 GiB large GCE persistent disk called mongodb. You can
ignore the warning about the disk size, because you don’t care about the disk’s perfor-
mance for the tests you’re about to run.
CREATING A POD USING A GCEPERSISTENTDISK VOLUME
Now that you have your physical storage properly set up, you can use it in a volume
inside your MongoDB pod. You’re going to prepare the YAML for the pod, which is
shown in the following listing.
apiVersion: v1
kind: Pod
metadata:
name: mongodb
spec: The type of the volume
volumes: is a GCE Persistent Disk.
- name: mongodb-data
The name gcePersistentDisk: The name of the persistent
of the pdName: mongodb disk must match the actual
volume fsType: ext4 PD you created earlier.
(also
containers:
referenced
- image: mongo The filesystem type is EXT4
when
mounting name: mongodb (a type of Linux filesystem).
the volume) volumeMounts:
- name: mongodb-data
mountPath: /data/db
The path where MongoDB
ports: stores its data
- containerPort: 27017
protocol: TCP
NOTE If you’re using Minikube, you can’t use a GCE Persistent Disk, but you
can deploy mongodb-pod-hostpath.yaml, which uses a hostPath volume
instead of a GCE PD.
The pod contains a single container and a single volume backed by the GCE Per-
sistent Disk you’ve created (as shown in figure 6.5). You’re mounting the volume
inside the container at /data/db, because that’s where MongoDB stores its data.
Pod: mongodb
Container: mongodb
gcePersistentDisk:
volumeMounts: pdName: mongodb GCE
name: mongodb-data Volume:
mongodb Persistent Disk:
mountPath: /data/db mongodb
Figure 6.5 A pod with a single container running MongoDB, which mounts a volume referencing an
external GCE Persistent Disk
Using persistent storage 173
WRITING DATA TO THE PERSISTENT STORAGE BY ADDING DOCUMENTS TO YOUR MONGODB DATABASE
Now that you’ve created the pod and the container has been started, you can run the
MongoDB shell inside the container and use it to write some data to the data store.
You’ll run the shell as shown in the following listing.
Listing 6.5 Entering the MongoDB shell inside the mongodb pod
MongoDB allows storing JSON documents, so you’ll store one to see if it’s stored per-
sistently and can be retrieved after the pod is re-created. Insert a new JSON document
with the following commands:
> use mystore
switched to db mystore
> db.foo.insert({name:'foo'})
WriteResult({ "nInserted" : 1 })
You’ve inserted a simple JSON document with a single property (name: ’foo’). Now,
use the find() command to see the document you inserted:
> db.foo.find()
{ "_id" : ObjectId("57a61eb9de0cfd512374cc75"), "name" : "foo" }
There it is. The document should be stored in your GCE persistent disk now.
RE-CREATING THE POD AND VERIFYING THAT IT CAN READ THE DATA PERSISTED BY THE PREVIOUS POD
You can now exit the mongodb shell (type exit and press Enter), and then delete the
pod and recreate it:
$ kubectl delete pod mongodb
pod "mongodb" deleted
$ kubectl create -f mongodb-pod-gcepd.yaml
pod "mongodb" created
The new pod uses the exact same GCE persistent disk as the previous pod, so the
MongoDB container running inside it should see the exact same data, even if the pod
is scheduled to a different node.
TIP You can see what node a pod is scheduled to by running kubectl get po
-o wide.
174 CHAPTER 6 Volumes: attaching disk storage to containers
Once the container is up, you can again run the MongoDB shell and check to see if the
document you stored earlier can still be retrieved, as shown in the following listing.
As expected, the data is still there, even though you deleted the pod and re-created it.
This confirms you can use a GCE persistent disk to persist data across multiple pod
instances.
You’re done playing with the MongoDB pod, so go ahead and delete it again, but
hold off on deleting the underlying GCE persistent disk. You’ll use it again later in
the chapter.
apiVersion: v1
kind: Pod
metadata:
name: mongodb
spec:
volumes:
- name: mongodb-data Using awsElasticBlockStore
awsElasticBlockStore: instead of gcePersistentDisk
Using persistent storage 175
volumeId: my-volume
Specify the ID of the EBS
fsType: ext4 volume you created.
The filesystem type
containers: is EXT4 as before.
- ...
volumes:
- name: mongodb-data This volume is backed
nfs: by an NFS share.
server: 1.2.3.4
The IP of the
path: /some/path NFS server
The path exported
by the server
Persistent
2. Admin then creates a PersistentVolume (PV) Volume
by posting a PV descriptor to the Kubernetes API
3. User creates a
User
PersistentVolumeClaim (PVC) Persistent
VolumeClaim
4. Kubernetes finds a PV of
adequate size and access
mode and binds the PVC
Volume to the PV
5. User creates a
pod with a volume
Pod
referencing the PVC
Figure 6.6 PersistentVolumes are provisioned by cluster admins and consumed by pods
through PersistentVolumeClaims.
Decoupling pods from the underlying storage technology 177
Instead of the developer adding a technology-specific volume to their pod, it’s the
cluster administrator who sets up the underlying storage and then registers it in
Kubernetes by creating a PersistentVolume resource through the Kubernetes API
server. When creating the PersistentVolume, the admin specifies its size and the access
modes it supports.
When a cluster user needs to use persistent storage in one of their pods, they first
create a PersistentVolumeClaim manifest, specifying the minimum size and the access
mode they require. The user then submits the PersistentVolumeClaim manifest to the
Kubernetes API server, and Kubernetes finds the appropriate PersistentVolume and
binds the volume to the claim.
The PersistentVolumeClaim can then be used as one of the volumes inside a pod.
Other users cannot use the same PersistentVolume until it has been released by delet-
ing the bound PersistentVolumeClaim.
apiVersion: v1
kind: PersistentVolume
metadata:
name: mongodb-pv
spec:
capacity: Defining the
storage: 1Gi PersistentVolume’s size It can either be mounted by a single
accessModes: client for reading and writing or by
- ReadWriteOnce multiple clients for reading only.
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
After the claim is released,
gcePersistentDisk: the PersistentVolume
pdName: mongodb should be retained (not
fsType: ext4
The PersistentVolume is
backed by the GCE Persistent erased or deleted).
Disk you created earlier.
178 CHAPTER 6 Volumes: attaching disk storage to containers
When creating a PersistentVolume, the administrator needs to tell Kubernetes what its
capacity is and whether it can be read from and/or written to by a single node or by
multiple nodes at the same time. They also need to tell Kubernetes what to do with the
PersistentVolume when it’s released (when the PersistentVolumeClaim it’s bound to is
deleted). And last, but certainly not least, they need to specify the type, location, and
other properties of the actual storage this PersistentVolume is backed by. If you look
closely, this last part is exactly the same as earlier, when you referenced the GCE Per-
sistent Disk in the pod volume directly (shown again in the following listing).
spec:
volumes:
- name: mongodb-data
gcePersistentDisk:
pdName: mongodb
fsType: ext4
...
After you create the PersistentVolume with the kubectl create command, it should
be ready to be claimed. See if it is by listing all PersistentVolumes:
$ kubectl get pv
NAME CAPACITY RECLAIMPOLICY ACCESSMODES STATUS CLAIM
mongodb-pv 1Gi Retain RWO,ROX Available
As expected, the PersistentVolume is shown as Available, because you haven’t yet cre-
ated the PersistentVolumeClaim.
Namespace A Namespace B
User A User B
Persistent Persistent
Volume Volume
Claim(s) Claim(s)
Pod(s) Pod(s)
Figure 6.7 PersistentVolumes, like cluster Nodes, don’t belong to any namespace, unlike pods and
PersistentVolumeClaims.
apiVersion: v1
kind: PersistentVolumeClaim The name of your claim—you’ll
metadata: need this later when using the
name: mongodb-pvc claim as the pod’s volume.
180 CHAPTER 6 Volumes: attaching disk storage to containers
spec:
resources:
requests:
Requesting 1 GiB of storage
storage: 1Gi
accessModes: You want the storage to support a single
- ReadWriteOnce client (performing both reads and writes).
storageClassName: ""
You’ll learn about this in the section
about dynamic provisioning.
As soon as you create the claim, Kubernetes finds the appropriate PersistentVolume
and binds it to the claim. The PersistentVolume’s capacity must be large enough to
accommodate what the claim requests. Additionally, the volume’s access modes must
include the access modes requested by the claim. In your case, the claim requests 1 GiB
of storage and a ReadWriteOnce access mode. The PersistentVolume you created ear-
lier matches those two requirements so it is bound to your claim. You can see this by
inspecting the claim.
LISTING PERSISTENTVOLUMECLAIMS
List all PersistentVolumeClaims to see the state of your PVC:
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
mongodb-pvc Bound mongodb-pv 1Gi RWO,ROX 3s
NOTE RWO, ROX, and RWX pertain to the number of worker nodes that can use
the volume at the same time, not to the number of pods!
LISTING PERSISTENTVOLUMES
You can also see that the PersistentVolume is now Bound and no longer Available by
inspecting it with kubectl get:
$ kubectl get pv
NAME CAPACITY ACCESSMODES STATUS CLAIM AGE
mongodb-pv 1Gi RWO,ROX Bound default/mongodb-pvc 1m
apiVersion: v1
kind: Pod
metadata:
name: mongodb
spec:
containers:
- image: mongo
name: mongodb
volumeMounts:
- name: mongodb-data
mountPath: /data/db
ports:
- containerPort: 27017
protocol: TCP
volumes:
- name: mongodb-data
persistentVolumeClaim: Referencing the PersistentVolumeClaim
claimName: mongodb-pvc by name in the pod volume
Go ahead and create the pod. Now, check to see if the pod is indeed using the same
PersistentVolume and its underlying GCE PD. You should see the data you stored ear-
lier by running the MongoDB shell again, as shown in the following listing.
Listing 6.13 Retrieving MongoDB’s persisted data in the pod using the PVC and PV
And there it is. You‘re able to retrieve the document you stored into MongoDB
previously.
182 CHAPTER 6 Volumes: attaching disk storage to containers
Pod: mongodb
Container: mongodb
gcePersistentDisk:
volumeMounts: pdName: mongodb GCE
name: mongodb-data Volume:
mongodb Persistent Disk:
mountPath: /data/db mongodb
Pod: mongodb
GCE
Persistent Disk:
Container: mongodb
mongodb
volumeMounts:
name: mongodb-data Volume:
mountPath: /data/db mongodb
gcePersistentDisk:
pdName: mongodb
persistentVolumeClaim:
claimName: mongodb-pvc Claim lists
1Gi and
ReadWriteOnce
PersistentVolumeClaim: access PersistentVolume:
mongodb-pvc mongodb-pv
(1 Gi, RWO, RWX)
Figure 6.8 Using the GCE Persistent Disk directly or through a PVC and PV
Consider how using this indirect method of obtaining storage from the infrastructure
is much simpler for the application developer (or cluster user). Yes, it does require
the additional steps of creating the PersistentVolume and the PersistentVolumeClaim,
but the developer doesn’t have to know anything about the actual storage technology
used underneath.
Additionally, the same pod and claim manifests can now be used on many different
Kubernetes clusters, because they don’t refer to anything infrastructure-specific. The
claim states, “I need x amount of storage and I need to be able to read and write to it
by a single client at once,” and then the pod references the claim by name in one of
its volumes.
Decoupling pods from the underlying storage technology 183
What if you create the PersistentVolumeClaim again? Will it be bound to the Persistent-
Volume or not? After you create the claim, what does kubectl get pvc show?
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
mongodb-pvc Pending 13s
The claim’s status is shown as Pending. Interesting. When you created the claim ear-
lier, it was immediately bound to the PersistentVolume, so why wasn’t it bound now?
Maybe listing the PersistentVolumes can shed more light on this:
$ kubectl get pv
NAME CAPACITY ACCESSMODES STATUS CLAIM REASON AGE
mongodb-pv 1Gi RWO,ROX Released default/mongodb-pvc 5m
The STATUS column shows the PersistentVolume as Released, not Available like
before. Because you’ve already used the volume, it may contain data and shouldn’t be
bound to a completely new claim without giving the cluster admin a chance to clean it
up. Without this, a new pod using the same PersistentVolume could read the data
stored there by the previous pod, even if the claim and pod were created in a different
namespace (and thus likely belong to a different cluster tenant).
RECLAIMING PERSISTENTVOLUMES MANUALLY
You told Kubernetes you wanted your PersistentVolume to behave like this when you
created it—by setting its persistentVolumeReclaimPolicy to Retain. You wanted
Kubernetes to retain the volume and its contents after it’s released from its claim. As
far as I’m aware, the only way to manually recycle the PersistentVolume to make it
available again is to delete and recreate the PersistentVolume resource. As you do
that, it’s your decision what to do with the files on the underlying storage: you can
either delete them or leave them alone so they can be reused by the next pod.
RECLAIMING PERSISTENTVOLUMES AUTOMATICALLY
Two other possible reclaim policies exist: Recycle and Delete. The first one deletes
the volume’s contents and makes the volume available to be claimed again. This way,
the PersistentVolume can be reused multiple times by different PersistentVolume-
Claims and different pods, as you can see in figure 6.9.
The Delete policy, on the other hand, deletes the underlying storage. Note that
the Recycle option is currently not available for GCE Persistent Disks. This type of
184 CHAPTER 6 Volumes: attaching disk storage to containers
PersistentVolume
PersistentVolumeClaim 1 PersistentVolumeClaim 2
Time
Figure 6.9 The lifespan of a PersistentVolume, PersistentVolumeClaims, and pods using them
Kubernetes includes provisioners for the most popular cloud providers, so the admin-
istrator doesn’t always need to deploy a provisioner. But if Kubernetes is deployed
on-premises, a custom provisioner needs to be deployed.
Dynamic provisioning of PersistentVolumes 185
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast The volume plugin to
provisioner: kubernetes.io/gce-pd use for provisioning
parameters: the PersistentVolume
type: pd-ssd The parameters passed
zone: europe-west1-b to the provisioner
The StorageClass resource specifies which provisioner should be used for provision-
ing the PersistentVolume when a PersistentVolumeClaim requests this StorageClass.
The parameters defined in the StorageClass definition are passed to the provisioner
and are specific to each provisioner plugin.
The StorageClass uses the Google Compute Engine (GCE) Persistent Disk (PD)
provisioner, which means it can be used when Kubernetes is running in GCE. For
other cloud providers, other provisioners need to be used.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mongodb-pvc
186 CHAPTER 6 Volumes: attaching disk storage to containers
spec:
storageClassName: fast
This PVC requests the
resources: custom storage class.
requests:
storage: 100Mi
accessModes:
- ReadWriteOnce
Apart from specifying the size and access modes, your PersistentVolumeClaim now
also specifies the class of storage you want to use. When you create the claim, the
PersistentVolume is created by the provisioner referenced in the fast StorageClass
resource. The provisioner is used even if an existing manually provisioned Persistent-
Volume matches the PersistentVolumeClaim.
The VOLUME column shows the PersistentVolume that’s bound to this claim (the actual
name is longer than what’s shown above). You can try listing PersistentVolumes now to
see that a new PV has indeed been created automatically:
$ kubectl get pv
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS STORAGECLASS
mongodb-pv 1Gi RWO,ROX Retain Released
pvc-1e6bc048 1Gi RWO Delete Bound fast
You can see the dynamically provisioned PersistentVolume. Its capacity and access
modes are what you requested in the PVC. Its reclaim policy is Delete, which means
the PersistentVolume will be deleted when the PVC is deleted. Beside the PV, the pro-
visioner also provisioned the actual storage. Your fast StorageClass is configured to
use the kubernetes.io/gce-pd provisioner, which provisions GCE Persistent Disks.
You can see the disk with the following command:
$ gcloud compute disks list
NAME ZONE SIZE_GB TYPE STATUS
gke-kubia-dyn-pvc-1e6bc048 europe-west1-d 1 pd-ssd READY
gke-kubia-default-pool-71df europe-west1-d 100 pd-standard READY
gke-kubia-default-pool-79cd europe-west1-d 100 pd-standard READY
gke-kubia-default-pool-blc4 europe-west1-d 100 pd-standard READY
mongodb europe-west1-d 1 pd-standard READY
Dynamic provisioning of PersistentVolumes 187
As you can see, the first persistent disk’s name suggests it was provisioned dynamically
and its type shows it’s an SSD, as specified in the storage class you created earlier.
UNDERSTANDING HOW TO USE STORAGE CLASSES
The cluster admin can create multiple storage classes with different performance or
other characteristics. The developer then decides which one is most appropriate for
each claim they create.
The nice thing about StorageClasses is the fact that claims refer to them by
name. The PVC definitions are therefore portable across different clusters, as long
as the StorageClass names are the same across all of them. To see this portability
yourself, you can try running the same example on Minikube, if you’ve been using
GKE up to this point. As a cluster admin, you’ll have to create a different storage
class (but with the same name). The storage class defined in the storageclass-fast-
hostpath.yaml file is tailor-made for use in Minikube. Then, once you deploy the stor-
age class, you as a cluster user can deploy the exact same PVC manifest and the exact
same pod manifest as before. This shows how the pods and PVCs are portable across
different clusters.
Beside the fast storage class, which you created yourself, a standard storage class
exists and is marked as default. You’ll learn what that means in a moment. Let’s list the
storage classes available in Minikube, so we can compare:
$ kubectl get sc
NAME TYPE
fast k8s.io/minikube-hostpath
standard (default) k8s.io/minikube-hostpath
Again, the fast storage class was created by you and a default standard storage class
exists here as well. Comparing the TYPE columns in the two listings, you see GKE is
188 CHAPTER 6 Volumes: attaching disk storage to containers
If you look closely toward the top of the listing, the storage class definition includes an
annotation, which makes this the default storage class. The default storage class is
what’s used to dynamically provision a PersistentVolume if the PersistentVolumeClaim
doesn’t explicitly say which storage class to use.
CREATING A PERSISTENTVOLUMECLAIM WITHOUT SPECIFYING A STORAGE CLASS
You can create a PVC without specifying the storageClassName attribute and (on
Google Kubernetes Engine) a GCE Persistent Disk of type pd-standard will be provi-
sioned for you. Try this by creating a claim from the YAML in the following listing.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mongodb-pvc2
spec:
resources: You’re not specifying
requests: the storageClassName
storage: 100Mi attribute (unlike earlier
accessModes: examples).
- ReadWriteOnce
Dynamic provisioning of PersistentVolumes 189
This PVC definition includes only the storage size request and the desired access
modes, but no storage class. When you create the PVC, whatever storage class is
marked as default will be used. You can confirm that’s the case:
If you hadn’t set the storageClassName attribute to an empty string, the dynamic vol-
ume provisioner would have provisioned a new PersistentVolume, despite there being
an appropriate pre-provisioned PersistentVolume. At that point, I wanted to demon-
strate how a claim gets bound to a manually pre-provisioned PersistentVolume. I didn’t
want the dynamic provisioner to interfere.
TIP Explicitly set storageClassName to "" if you want the PVC to use a pre-
provisioned PersistentVolume.
Storage
2. Admin creates one or Class 4. Kubernetes looks up the 5. Provisioner provisions the
more StorageClasses StorageClass and the provisioner actual storage, creates
and marks one as the referenced in it and asks the provisioner a PersistentVolume, and
default (it may already to provision a new PV based on the binds it to the PVC
exist) PVC’s requested access mode and
storage size and the parameters
in the StorageClass
Volume
6. User creates a pod with
a volume referencing the
Pod
PVC by name
6.7 Summary
This chapter has shown you how volumes are used to provide either temporary or per-
sistent storage to a pod’s containers. You’ve learned how to
Create a multi-container pod and have the pod’s containers operate on the
same files by adding a volume to the pod and mounting it in each container
Use the emptyDir volume to store temporary, non-persistent data
Use the gitRepo volume to easily populate a directory with the contents of a Git
repository at pod startup
Use the hostPath volume to access files from the host node
Mount external storage in a volume to persist pod data across pod restarts
Decouple the pod from the storage infrastructure by using PersistentVolumes
and PersistentVolumeClaims
Have PersistentVolumes of the desired (or the default) storage class dynami-
cally provisioned for each PersistentVolumeClaim
Prevent the dynamic provisioner from interfering when you want the Persistent-
VolumeClaim to be bound to a pre-provisioned PersistentVolume
In the next chapter, you’ll see what mechanisms Kubernetes provides to deliver con-
figuration data, secret information, and metadata about the pod and container to the
processes running inside a pod. This is done with the special types of volumes we’ve
mentioned in this chapter, but not yet explored.
ConfigMaps and Secrets:
configuring applications
Up to now you haven’t had to pass any kind of configuration data to the apps you’ve
run in the exercises in this book. Because almost all apps require configuration (set-
tings that differ between deployed instances, credentials for accessing external sys-
tems, and so on), which shouldn’t be baked into the built app itself, let’s see how to
pass configuration options to your app when running it in Kubernetes.
191
192 CHAPTER 7 ConfigMaps and Secrets: configuring applications
We’ll go over all these options in the next few sections, but before we start, let’s look
at config options from a security perspective. Though most configuration options
don’t contain any sensitive information, several can. These include credentials, pri-
vate encryption keys, and similar data that needs to be kept secure. This type of infor-
mation needs to be handled with special care, which is why Kubernetes offers
another type of first-class object called a Secret. We’ll learn about it in the last part of
this chapter.
instead of the one specified in the image, or want to run it with a different set of com-
mand-line arguments. We’ll look at how to do that now.
Although you can use the CMD instruction to specify the command you want to execute
when the image is run, the correct way is to do it through the ENTRYPOINT instruction
and to only specify the CMD if you want to define the default arguments. The image can
then be run without specifying any arguments
or with additional arguments, which override whatever’s set under CMD in the Dockerfile:
The difference is whether the specified command is invoked inside a shell or not.
In the kubia image you created in chapter 2, you used the exec form of the ENTRY-
POINT instruction:
This runs the node process directly (not inside a shell), as you can see by listing the
processes running inside the container:
$ docker exec 4675d ps x
PID TTY STAT TIME COMMAND
1 ? Ssl 0:00 node app.js
12 ? Rs 0:00 ps x
If you’d used the shell form (ENTRYPOINT node app.js), these would have been the
container’s processes:
$ docker exec -it e4bad ps x
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 /bin/sh -c node app.js
194 CHAPTER 7 ConfigMaps and Secrets: configuring applications
As you can see, in that case, the main process (PID 1) would be the shell process
instead of the node process. The node process (PID 7) would be started from that
shell. The shell process is unnecessary, which is why you should always use the exec
form of the ENTRYPOINT instruction.
MAKING THE INTERVAL CONFIGURABLE IN YOUR FORTUNE IMAGE
Let’s modify your fortune script and image so the delay interval in the loop is configu-
rable. You’ll add an INTERVAL variable and initialize it with the value of the first com-
mand-line argument, as shown in the following listing.
Listing 7.1 Fortune script with interval configurable through argument: fortune-args/
fortuneloop.sh
#!/bin/bash
trap "exit" SIGINT
INTERVAL=$1
echo Configured to generate new fortune every $INTERVAL seconds
mkdir -p /var/htdocs
while :
do
echo $(date) Writing fortune to /var/htdocs/index.html
/usr/games/fortune > /var/htdocs/index.html
sleep $INTERVAL
done
You’ve added or modified the lines in bold font. Now, you’ll modify the Dockerfile so
it uses the exec version of the ENTRYPOINT instruction and sets the default interval to
10 seconds using the CMD instruction, as shown in the following listing.
You can now build and push the image to Docker Hub. This time, you’ll tag the image
as args instead of latest:
$ docker build -t docker.io/luksa/fortune:args .
$ docker push docker.io/luksa/fortune:args
And you can override the default sleep interval by passing it as an argument:
$ docker run -it docker.io/luksa/fortune:args 15
Configured to generate new fortune every 15 seconds
Now that you’re sure your image honors the argument passed to it, let’s see how to use
it in a pod.
kind: Pod
spec:
containers:
- image: some/image
command: ["/bin/command"]
args: ["arg1", "arg2", "arg3"]
In most cases, you’ll only set custom arguments and rarely override the command
(except in general-purpose images such as busybox, which doesn’t define an ENTRY-
POINT at all).
NOTE The command and args fields can’t be updated after the pod is created.
The two Dockerfile instructions and the equivalent pod spec fields are shown in table 7.1.
Table 7.1 Specifying the executable and its arguments in Docker vs Kubernetes
apiVersion: v1
kind: Pod
metadata: You changed the
name: fortune2s pod’s name.
196 CHAPTER 7 ConfigMaps and Secrets: configuring applications
spec:
containers: Using fortune:args
- image: luksa/fortune:args instead of fortune:latest
args: ["2"]
This argument makes the
name: html-generator script generate a new fortune
volumeMounts: every two seconds.
- name: html
mountPath: /var/htdocs
...
You added the args array to the container definition. Try creating this pod now. The
values of the array will be passed to the container as command-line arguments when it
is run.
The array notation used in this listing is great if you have one argument or a few. If
you have several, you can also use the following notation:
args:
- foo
- bar
- "15"
TIP You don’t need to enclose string values in quotations marks (but you
must enclose numbers).
MAKING THE INTERVAL IN YOUR FORTUNE IMAGE CONFIGURABLE THROUGH AN ENVIRONMENT VARIABLE
Let’s see how to modify your fortuneloop.sh script once again to allow it to be config-
ured from an environment variable, as shown in the following listing.
Listing 7.5 Fortune script with interval configurable through env var: fortune-env/
fortuneloop.sh
#!/bin/bash
trap "exit" SIGINT
echo Configured to generate new fortune every $INTERVAL seconds
mkdir -p /var/htdocs
while :
do
echo $(date) Writing fortune to /var/htdocs/index.html
/usr/games/fortune > /var/htdocs/index.html
sleep $INTERVAL
done
All you had to do was remove the row where the INTERVAL variable is initialized. Because
your “app” is a simple bash script, you didn’t need to do anything else. If the app was
written in Java you’d use System.getenv("INTERVAL"), whereas in Node.JS you’d use
process.env.INTERVAL, and in Python you’d use os.environ['INTERVAL'].
kind: Pod
spec:
containers:
- image: luksa/fortune:env
env:
- name: INTERVAL Adding a single variable to
value: "30"
the environment variable list
name: html-generator
...
As mentioned previously, you set the environment variable inside the container defini-
tion, not at the pod level.
env:
- name: FIRST_VAR
value: "foo"
- name: SECOND_VAR
value: "$(FIRST_VAR)bar"
In this case, the SECOND_VAR’s value will be "foobar". Similarly, both the command and
args attributes you learned about in section 7.2 can also refer to environment vari-
ables like this. You’ll use this method in section 7.4.5.
Environment variables
ConfigMap
key1=value1
key2=value2
...
configMap
volume
Sure, the application can also read the contents of a ConfigMap directly through the
Kubernetes REST API endpoint if needed, but unless you have a real need for this,
you should keep your app Kubernetes-agnostic as much as possible.
Regardless of how an app consumes a ConfigMap, having the config in a separate
standalone object like this allows you to keep multiple manifests for ConfigMaps with
the same name, each for a different environment (development, testing, QA, produc-
tion, and so on). Because pods reference the ConfigMap by name, you can use a dif-
ferent config in each environment while using the same pod specification across all of
them (see figure 7.3).
ConfigMaps created
from different manifests
ConfigMap: ConfigMap:
app-config app-config
Pod(s) Pod(s)
(contains (contains
development production
values) values)
Figure 7.3 Two different ConfigMaps with the same name used in different
environments
200 CHAPTER 7 ConfigMaps and Secrets: configuring applications
NOTE ConfigMap keys must be a valid DNS subdomain (they may only con-
tain alphanumeric characters, dashes, underscores, and dots). They may
optionally include a leading dot.
ConfigMap: fortune-config
Figure 7.4 The fortune-config
sleep-interval 25
ConfigMap containing a single entry
ConfigMaps usually contain more than one entry. To create a ConfigMap with multi-
ple literal entries, you add multiple --from-literal arguments:
$ kubectl create configmap myconfigmap
➥ --from-literal=foo=bar --from-literal=bar=baz --from-literal=one=two
Let’s inspect the YAML descriptor of the ConfigMap you created by using the kubectl
get command, as shown in the following listing.
Nothing extraordinary. You could easily have written this YAML yourself (you wouldn’t
need to specify anything but the name in the metadata section, of course) and posted
it to the Kubernetes API with the well-known
When you run the previous command, kubectl looks for the file config-file.conf in
the directory you run kubectl in. It will then store the contents of the file under the
key config-file.conf in the ConfigMap (the filename is used as the map key), but
you can also specify a key manually like this:
This command will store the file’s contents under the key customkey. As with literals,
you can add multiple files by using the --from-file argument multiple times.
CREATING A CONFIGMAP FROM FILES IN A DIRECTORY
Instead of importing each file individually, you can even import all files from a file
directory:
In this case, kubectl will create an individual map entry for each file in the specified
directory, but only for files whose name is a valid ConfigMap key.
COMBINING DIFFERENT OPTIONS
When creating ConfigMaps, you can use a combination of all the options mentioned
here (note that these files aren’t included in the book’s code archive—you can create
them yourself if you’d like to try out the command):
A single file
$ kubectl create configmap my-config
➥ --from-file=foo.json A file stored under
➥ --from-file=bar=foobar.conf a custom key
➥ --from-file=config-opts/
➥ --from-literal=some=thing A whole directory
A literal value
Here, you’ve created the ConfigMap from multiple sources: a whole directory, a file,
another file (but stored under a custom key instead of using the filename as the key),
and a literal value. Figure 7.5 shows all these sources and the resulting ConfigMap.
202 CHAPTER 7 ConfigMaps and Secrets: configuring applications
foo.json
{
foo: bar
baz: 5
config-opts directory
}
true 100
--from-file=foo.json
ConfigMap: my-config
debug repeat
Key Value
foo.json {
foo: bar
baz: 5
}
--from-file=config-opts/
bar abc
debug true
repeat 100
some thing
--from-file=bar=foobar.conf --from-literal=some=thing
abc Literal
some=thing
foobar.conf
Figure 7.5 Creating a ConfigMap from individual files, a directory, and a literal value
Listing 7.9 Pod with env var from a config map: fortune-pod-env-configmap.yaml
apiVersion: v1
kind: Pod
Decoupling configuration with a ConfigMap 203
metadata:
name: fortune-env-from-configmap
spec:
containers:
- image: luksa/fortune:env You’re setting the environment
env: variable called INTERVAL.
- name: INTERVAL
valueFrom: Instead of setting a fixed value, you're
configMapKeyRef: initializing it from a ConfigMap key.
name: fortune-config
key: sleep-interval
... The name of the ConfigMap
you're referencing
You're setting the variable to whatever is
stored under this key in the ConfigMap.
You defined an environment variable called INTERVAL and set its value to whatever is
stored in the fortune-config ConfigMap under the key sleep-interval. When the
process running in the html-generator container reads the INTERVAL environment
variable, it will see the value 25 (shown in figure 7.6).
Container: web-server
ConfigMap: fortune-config
Container: html-generator
sleep-interval 25
fortuneloop.sh Environment variables
process INTERVAL=25
Pod
Figure 7.6 Passing a ConfigMap entry as
an environment variable to a container
Listing 7.10 Pod with env vars from all entries of a ConfigMap
spec:
containers: Using envFrom instead of env
- image: some-image All environment variables will
envFrom: be prefixed with CONFIG_.
- prefix: CONFIG_
configMapRef: Referencing the ConfigMap
name: my-config-map called my-config-map
...
As you can see, you can also specify a prefix for the environment variables (CONFIG_ in
this case). This results in the following two environment variables being present inside
the container: CONFIG_FOO and CONFIG_BAR.
NOTE The prefix is optional, so if you omit it the environment variables will
have the same name as the keys.
Did you notice I said two variables, but earlier, I said the ConfigMap has three entries
(FOO, BAR, and FOO-BAR)? Why is there no environment variable for the FOO-BAR
ConfigMap entry?
The reason is that CONFIG_FOO-BAR isn’t a valid environment variable name
because it contains a dash. Kubernetes doesn’t convert the keys in any way (it doesn’t
convert dashes to underscores, for example). If a ConfigMap key isn’t in the proper
format, it skips the entry (but it does record an event informing you it skipped it).
Container: web-server
ConfigMap: fortune-config
Container: html-generator
sleep-interval 25
Environment variables
fortuneloop.sh $(INTERVAL) INTERVAL=25
Pod
apiVersion: v1
kind: Pod
metadata:
name: fortune-args-from-configmap Using the image that takes the
spec: interval from the first argument,
containers: not from an environment variable
- image: luksa/fortune:args
env:
- name: INTERVAL
valueFrom:
Defining the
environment variable
configMapKeyRef:
exactly as before
name: fortune-config
key: sleep-interval
args: ["$(INTERVAL)"]
... Referencing the environment
variable in the argument
You defined the environment variable exactly as you did before, but then you used the
$(ENV_VARIABLE_NAME) syntax to have Kubernetes inject the value of the variable into
the argument.
Although this method is mostly meant for passing large config files to the con-
tainer, nothing prevents you from passing short single values this way.
CREATING THE CONFIGMAP
Instead of modifying your fortuneloop.sh script once again, you’ll now try a different
example. You’ll use a config file to configure the Nginx web server running inside the
fortune pod’s web-server container. Let’s say you want your Nginx server to compress
responses it sends to the client. To enable compression, the config file for Nginx
needs to look like the following listing.
server {
listen 80;
server_name www.kubia-example.com;
Now delete your existing fortune-config ConfigMap with kubectl delete config-
map fortune-config, so that you can replace it with a new one, which will include the
Nginx config file. You’ll create the ConfigMap from files stored on your local disk.
Create a new directory called configmap-files and store the Nginx config from the
previous listing into configmap-files/my-nginx-config.conf. To make the ConfigMap
also contain the sleep-interval entry, add a plain text file called sleep-interval to the
same directory and store the number 25 in it (see figure 7.8).
configmap-files/
server { 25
listen 80;
server_name www.kubia...
...
}
my-nginx-config.conf sleep-interval
Figure 7.8 The contents of the
configmap-files directory and its files
Now create a ConfigMap from all the files in the directory like this:
$ kubectl create configmap fortune-config --from-file=configmap-files
configmap "fortune-config" created
Decoupling configuration with a ConfigMap 207
The following listing shows what the YAML of this ConfigMap looks like.
NOTE The pipeline character after the colon in the first line of both entries
signals that a literal multi-line value follows.
The ConfigMap contains two entries, with keys corresponding to the actual names
of the files they were created from. You’ll now use the ConfigMap in both of your
pod’s containers.
USING THE CONFIGMAP'S ENTRIES IN A VOLUME
Creating a volume populated with the contents of a ConfigMap is as easy as creating
a volume that references the ConfigMap by name and mounting the volume in a
container. You already learned how to create volumes and mount them, so the only
thing left to learn is how to initialize the volume with files created from a Config-
Map’s entries.
Nginx reads its config file from /etc/nginx/nginx.conf. The Nginx image
already contains this file with default configuration options, which you don’t want
to override, so you don’t want to replace this file as a whole. Luckily, the default
config file automatically includes all .conf files in the /etc/nginx/conf.d/ subdirec-
tory as well, so you should add your config file in there. Figure 7.9 shows what you
want to achieve.
The pod descriptor is shown in listing 7.14 (the irrelevant parts are omitted, but
you’ll find the complete file in the code archive).
208 CHAPTER 7 ConfigMaps and Secrets: configuring applications
Container: web-server
Filesystem
/
etc/ ConfigMap: fortune-config
nginx/
conf.d/ Volume:
config my-nginx-config.conf server {
…
}
Container: html-generator
Pod
apiVersion: v1
kind: Pod
metadata:
name: fortune-configmap-volume
spec:
containers:
- image: nginx:alpine
name: web-server
volumeMounts:
... You’re mounting the
- name: config configMap volume at
mountPath: /etc/nginx/conf.d this location.
readOnly: true
...
volumes:
...
- name: config
configMap: The volume refers to your
name: fortune-config fortune-config ConfigMap.
...
Both entries from the ConfigMap have been added as files to the directory. The
sleep-interval entry is also included, although it has no business being there,
because it’s only meant to be used by the fortuneloop container. You could create
two different ConfigMaps and use one to configure the fortuneloop container and
the other one to configure the web-server container. But somehow it feels wrong to
use multiple ConfigMaps to configure containers of the same pod. After all, having
containers in the same pod implies that the containers are closely related and should
probably also be configured as a unit.
EXPOSING CERTAIN CONFIGMAP ENTRIES IN THE VOLUME
Luckily, you can populate a configMap volume with only part of the ConfigMap’s
entries—in your case, only the my-nginx-config.conf entry. This won’t affect the
fortuneloop container, because you’re passing the sleep-interval entry to it through
an environment variable and not through the volume.
To define which entries should be exposed as files in a configMap volume, use the
volume’s items attribute as shown in the following listing.
Listing 7.16 A pod with a specific ConfigMap entry mounted into a file directory:
fortune-pod-configmap-volume-with-items.yaml
When specifying individual entries, you need to set the filename for each individual
entry, along with the entry’s key. If you run the pod from the previous listing, the
/etc/nginx/conf.d directory is kept nice and clean, because it only contains the
gzip.conf file and nothing else.
UNDERSTANDING THAT MOUNTING A DIRECTORY HIDES EXISTING FILES IN THAT DIRECTORY
There’s one important thing to discuss at this point. In both this and in your previous
example, you mounted the volume as a directory, which means you’ve hidden any files
that are stored in the /etc/nginx/conf.d directory in the container image itself.
This is generally what happens in Linux when you mount a filesystem into a non-
empty directory. The directory then only contains the files from the mounted filesys-
tem, whereas the original files in that directory are inaccessible for as long as the
filesystem is mounted.
In your case, this has no terrible side effects, but imagine mounting a volume to
the /etc directory, which usually contains many important files. This would most likely
break the whole container, because all of the original files that should be in the /etc
directory would no longer be there. If you need to add a file to a directory like /etc,
you can’t use this method at all.
MOUNTING INDIVIDUAL CONFIGMAP ENTRIES AS FILES WITHOUT HIDING OTHER FILES IN THE DIRECTORY
Naturally, you’re now wondering how to add individual files from a ConfigMap into
an existing directory without hiding existing files stored in it. An additional subPath
property on the volumeMount allows you to mount either a single file or a single direc-
tory from the volume instead of mounting the whole volume. Perhaps this is easier to
explain visually (see figure 7.10).
Say you have a configMap volume containing a myconfig.conf file, which you want
to add to the /etc directory as someconfig.conf. You can use the subPath property to
mount it there without affecting any other files in that directory. The relevant part of
the pod definition is shown in the following listing.
Pod
Container configMap
volume ConfigMap: app-config
Filesystem
/ myconfig.conf Contents
etc/ of the file
someconfig.conf myconfig.conf
existingfile1 another-file Contents
existingfile2 of the file
another-file
Listing 7.17 A pod with a specific config map entry mounted into a specific file
spec:
You’re mounting into
containers:
a file, not a directory.
- image: some/image
volumeMounts:
- name: myvolume Instead of mounting the whole
mountPath: /etc/someconfig.conf volume, you’re only mounting
subPath: myconfig.conf the myconfig.conf entry.
The subPath property can be used when mounting any kind of volume. Instead of
mounting the whole volume, you can mount part of it. But this method of mounting
individual files has a relatively big deficiency related to updating files. You’ll learn
more about this in the following section, but first, let’s finish talking about the initial
state of a configMap volume by saying a few words about file permissions.
SETTING THE FILE PERMISSIONS FOR FILES IN A CONFIGMAP VOLUME
By default, the permissions on all files in a configMap volume are set to 644 (-rw-r—r--).
You can change this by setting the defaultMode property in the volume spec, as shown
in the following listing.
volumes:
- name: config
configMap:
name: fortune-config This sets the permissions
defaultMode: "6600" for all files to -rw-rw------.
Although ConfigMaps should be used for non-sensitive configuration data, you may
want to make the file readable and writable only to the user and group the file is
owned by, as the example in the previous listing shows.
WARNING Be aware that as I’m writing this, it takes a surprisingly long time
for the files to be updated after you update the ConfigMap (it can take up to
one whole minute).
212 CHAPTER 7 ConfigMaps and Secrets: configuring applications
EDITING A CONFIGMAP
Let’s see how you can change a ConfigMap and have the process running in the pod
reload the files exposed in the configMap volume. You’ll modify the Nginx config file
from your previous example and make Nginx use the new config without restarting
the pod. Try switching gzip compression off by editing the fortune-config Config-
Map with kubectl edit:
Once your editor opens, change the gzip on line to gzip off, save the file, and then
close the editor. The ConfigMap is then updated, and soon afterward, the actual file
in the volume is updated as well. You can confirm this by printing the contents of the
file with kubectl exec:
$ kubectl exec fortune-configmap-volume -c web-server
➥ cat /etc/nginx/conf.d/my-nginx-config.conf
If you don’t see the update yet, wait a while and try again. It takes a while for the
files to get updated. Eventually, you’ll see the change in the config file, but you’ll
find this has no effect on Nginx, because it doesn’t watch the files and reload them
automatically.
SIGNALING NGINX TO RELOAD THE CONFIG
Nginx will continue to compress its responses until you tell it to reload its config files,
which you can do with the following command:
Now, if you try hitting the server again with curl, you should see the response is no
longer compressed (it no longer contains the Content-Encoding: gzip header).
You’ve effectively changed the app’s config without having to restart the container or
recreate the pod.
UNDERSTANDING HOW THE FILES ARE UPDATED ATOMICALLY
You may wonder what happens if an app can detect config file changes on its own and
reloads them before Kubernetes has finished updating all the files in the configMap
volume. Luckily, this can’t happen, because all the files are updated atomically, which
means all updates occur at once. Kubernetes achieves this by using symbolic links. If
you list all the files in the mounted configMap volume, you’ll see something like the
following listing.
As you can see, the files in the mounted configMap volume are symbolic links point-
ing to files in the ..data dir. The ..data dir is also a symbolic link pointing to a direc-
tory called ..4984_09_04_something. When the ConfigMap is updated, Kubernetes
creates a new directory like this, writes all the files to it, and then re-links the ..data
symbolic link to the new directory, effectively changing all files at once.
UNDERSTANDING THAT FILES MOUNTED INTO EXISTING DIRECTORIES DON’T GET UPDATED
One big caveat relates to updating ConfigMap-backed volumes. If you’ve mounted a
single file in the container instead of the whole volume, the file will not be updated!
At least, this is true at the time of writing this chapter.
For now, if you need to add an individual file and have it updated when you update
its source ConfigMap, one workaround is to mount the whole volume into a different
directory and then create a symbolic link pointing to the file in question. The sym-
link can either be created in the container image itself, or you could create the
symlink when the container starts.
UNDERSTANDING THE CONSEQUENCES OF UPDATING A CONFIGMAP
One of the most important features of containers is their immutability, which allows
us to be certain that no differences exist between multiple running containers created
from the same image, so is it wrong to bypass this immutability by modifying a Config-
Map used by running containers?
The main problem occurs when the app doesn’t support reloading its configura-
tion. This results in different running instances being configured differently—those
pods that are created after the ConfigMap is changed will use the new config, whereas
the old pods will still use the old one. And this isn’t limited to new pods. If a pod’s con-
tainer is restarted (for whatever reason), the new process will also see the new config.
Therefore, if the app doesn’t reload its config automatically, modifying an existing
ConfigMap (while pods are using it) may not be a good idea.
If the app does support reloading, modifying the ConfigMap usually isn’t such a
big deal, but you do need to be aware that because files in the ConfigMap volumes
aren’t updated synchronously across all running instances, the files in individual pods
may be out of sync for up to a whole minute.
Kubernetes helps keep your Secrets safe by making sure each Secret is only distributed
to the nodes that run the pods that need access to the Secret. Also, on the nodes
themselves, Secrets are always stored in memory and never written to physical storage,
which would require wiping the disks after deleting the Secrets from them.
On the master node itself (more specifically in etcd), Secrets used to be stored in
unencrypted form, which meant the master node needs to be secured to keep the sensi-
tive data stored in Secrets secure. This didn’t only include keeping the etcd storage
secure, but also preventing unauthorized users from using the API server, because any-
one who can create pods can mount the Secret into the pod and gain access to the sen-
sitive data through it. From Kubernetes version 1.7, etcd stores Secrets in encrypted
form, making the system much more secure. Because of this, it’s imperative you prop-
erly choose when to use a Secret or a ConfigMap. Choosing between them is simple:
Use a ConfigMap to store non-sensitive, plain configuration data.
Use a Secret to store any data that is sensitive in nature and needs to be kept
under key. If a config file includes both sensitive and not-sensitive data, you
should store the file in a Secret.
You already used Secrets in chapter 5, when you created a Secret to hold the TLS certifi-
cate needed for the Ingress resource. Now you’ll explore Secrets in more detail.
Every pod has a secret volume attached to it automatically. The volume in the previ-
ous kubectl describe output refers to a Secret called default-token-cfee9. Because
Secrets are resources, you can list them with kubectl get secrets and find the
default-token Secret in that list. Let’s see:
$ kubectl get secrets
NAME TYPE DATA AGE
default-token-cfee9 kubernetes.io/service-account-token 3 39d
Using Secrets to pass sensitive data to containers 215
You can also use kubectl describe to learn a bit more about it, as shown in the follow-
ing listing.
Data
====
ca.crt: 1139 bytes This secret
namespace: 7 bytes contains three
token: eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9... entries.
You can see that the Secret contains three entries—ca.crt, namespace, and token—
which represent everything you need to securely talk to the Kubernetes API server
from within your pods, should you need to do that. Although ideally you want your
application to be completely Kubernetes-agnostic, when there’s no alternative other
than to talk to Kubernetes directly, you’ll use the files provided through this secret
volume.
The kubectl describe pod command shows where the secret volume is mounted:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cfee9
To help you visualize where and how the default token Secret is mounted, see fig-
ure 7.11.
We’ve said Secrets are like ConfigMaps, so because this Secret contains three
entries, you can expect to see three files in the directory the secret volume is mounted
into. You can check this easily with kubectl exec:
$ kubectl exec mypod ls /var/run/secrets/kubernetes.io/serviceaccount/
ca.crt
namespace
token
You’ll see how your app can use these files to access the API server in the next chapter.
216 CHAPTER 7 ConfigMaps and Secrets: configuring applications
Container
Filesystem
/
var/
run/ Default token Secret
secrets/
kubernetes.io/ Default token ca.crt ...
serviceaccount/ secret
namespace ...
volume
token ...
Pod
Now, to help better demonstrate a few things about Secrets, create an additional
dummy file called foo and make it contain the string bar. You’ll understand why you
need to do this in a moment or two:
Now you can use kubectl create secret to create a Secret from the three files:
$ kubectl create secret generic fortune-https --from-file=https.key
➥ --from-file=https.cert --from-file=foo
secret "fortune-https" created
This isn’t very different from creating ConfigMaps. In this case, you’re creating a
generic Secret called fortune-https and including two entries in it (https.key with
the contents of the https.key file and likewise for the https.cert key/file). As you
learned earlier, you could also include the whole directory with --from-file=fortune-
https instead of specifying each file individually.
Using Secrets to pass sensitive data to containers 217
NOTE You’re creating a generic Secret, but you could also have created a tls
Secret with the kubectl create secret tls command, as you did in chapter 5.
This would create the Secret with different entry names, though.
Now compare this to the YAML of the ConfigMap you created earlier, which is shown
in the following listing.
Notice the difference? The contents of a Secret’s entries are shown as Base64-encoded
strings, whereas those of a ConfigMap are shown in clear text. This initially made
working with Secrets in YAML and JSON manifests a bit more painful, because you
had to encode and decode them when setting and reading their entries.
USING SECRETS FOR BINARY DATA
The reason for using Base64 encoding is simple. A Secret’s entries can contain binary
values, not only plain-text. Base64 encoding allows you to include the binary data in
YAML or JSON, which are both plain-text formats.
TIP You can use Secrets even for non-sensitive binary data, but be aware that
the maximum size of a Secret is limited to 1MB.
218 CHAPTER 7 ConfigMaps and Secrets: configuring applications
Listing 7.23 Adding plain text entries to a Secret using the stringData field
The stringData field is write-only (note: write-only, not read-only). It can only be
used to set values. When you retrieve the Secret’s YAML with kubectl get -o yaml, the
stringData field will not be shown. Instead, all entries you specified in the string-
Data field (such as the foo entry in the previous example) will be shown under data
and will be Base64-encoded like all the other entries.
READING A SECRET’S ENTRY IN A POD
When you expose the Secret to a container through a secret volume, the value of the
Secret entry is decoded and written to the file in its actual form (regardless if it’s plain
text or binary). The same is also true when exposing the Secret entry through an envi-
ronment variable. In both cases, the app doesn’t need to decode it, but can read the
file’s contents or look up the environment variable value and use it directly.
After the text editor opens, modify the part that defines the contents of the my-nginx-
config.conf entry so it looks like the following listing.
...
data:
my-nginx-config.conf: |
server {
listen 80;
listen 443 ssl;
server_name www.kubia-example.com;
Using Secrets to pass sensitive data to containers 219
location / {
root /usr/share/nginx/html;
index index.html index.htm;
}
}
sleep-interval: |
...
This configures the server to read the certificate and key files from /etc/nginx/certs,
so you’ll need to mount the secret volume there.
MOUNTING THE FORTUNE-HTTPS SECRET IN A POD
Next, you’ll create a new fortune-https pod and mount the secret volume holding
the certificate and key into the proper location in the web-server container, as shown
in the following listing.
apiVersion: v1
kind: Pod
metadata:
name: fortune-https
spec:
containers:
- image: luksa/fortune:env
name: html-generator
env:
- name: INTERVAL
valueFrom:
configMapKeyRef:
name: fortune-config
key: sleep-interval
volumeMounts:
- name: html
mountPath: /var/htdocs
- image: nginx:alpine
name: web-server
volumeMounts:
- name: html
mountPath: /usr/share/nginx/html
readOnly: true
- name: config
mountPath: /etc/nginx/conf.d
readOnly: true
- name: certs You configured Nginx to read the cert and
mountPath: /etc/nginx/certs/ key file from /etc/nginx/certs, so you need
readOnly: true to mount the Secret volume there.
ports:
- containerPort: 80
220 CHAPTER 7 ConfigMaps and Secrets: configuring applications
- containerPort: 443
volumes:
- name: html
emptyDir: {}
- name: config
configMap:
name: fortune-config
items:
- key: my-nginx-config.conf
path: https.conf
- name: certs You define the secret
secret: volume here, referring to
secretName: fortune-https the fortune-https Secret.
Much is going on in this pod descriptor, so let me help you visualize it. Figure 7.12
shows the components defined in the YAML. The default-token Secret, volume, and
volume mount, which aren’t part of the YAML, but are added to your pod automati-
cally, aren’t shown in the figure.
ConfigMap: fortune-config
Container: web-server
my-nginx-config.conf server {
configMap …
/etc/nginx/conf.d/
volume: }
config
/etc/nginx/certs/ sleep-interval 25
/usr/share/nginx/html/
Secret: fortune-https
emptyDir
/var/htdocs
volume: Default token Secret and volume not shown
html
Pod
Figure 7.12 Combining a ConfigMap and a Secret to run your fortune-https pod
NOTE Like configMap volumes, secret volumes also support specifying file
permissions for the files exposed in the volume through the defaultMode
property.
Using Secrets to pass sensitive data to containers 221
TESTING WHETHER NGINX IS USING THE CERT AND KEY FROM THE SECRET
Once the pod is running, you can see if it’s serving HTTPS traffic by opening a port-
forward tunnel to the pod’s port 443 and using it to send a request to the server
with curl:
$ kubectl port-forward fortune-https 8443:443 &
Forwarding from 127.0.0.1:8443 -> 443
Forwarding from [::1]:8443 -> 443
$ curl https://localhost:8443 -k
If you configured the server properly, you should get a response. You can check the
server’s certificate to see if it matches the one you generated earlier. This can also be
done with curl by turning on verbose logging using the -v option, as shown in the fol-
lowing listing.
$ curl https://localhost:8443 -k -v
* About to connect() to localhost port 8443 (#0)
* Trying ::1...
* Connected to localhost (::1) port 8443 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* subject: CN=www.kubia-example.com
* start date: aug 16 18:43:13 2016 GMT The certificate
* expire date: aug 14 18:43:13 2026 GMT matches the one you
* common name: www.kubia-example.com
created and stored
in the Secret.
* issuer: CN=www.kubia-example.com
Because tmpfs is used, the sensitive data stored in the Secret is never written to disk,
where it could be compromised.
EXPOSING A SECRET’S ENTRIES THROUGH ENVIRONMENT VARIABLES
Instead of using a volume, you could also have exposed individual entries from the
secret as environment variables, the way you did with the sleep-interval entry from
the ConfigMap. For example, if you wanted to expose the foo key from your Secret as
environment variable FOO_SECRET, you’d add the snippet from the following listing to
the container definition.
222 CHAPTER 7 ConfigMaps and Secrets: configuring applications
env:
- name: FOO_SECRET The variable should be set
valueFrom: from the entry of a Secret.
secretKeyRef:
name: fortune-https
The name of the Secret
key: foo holding the key
The key of the Secret
to expose
This is almost exactly like when you set the INTERVAL environment variable, except
that this time you’re referring to a Secret by using secretKeyRef instead of config-
MapKeyRef, which is used to refer to a ConfigMap.
Even though Kubernetes enables you to expose Secrets through environment vari-
ables, it may not be the best idea to use this feature. Applications usually dump envi-
ronment variables in error reports or even write them to the application log at startup,
which may unintentionally expose them. Additionally, child processes inherit all the
environment variables of the parent process, so if your app runs a third-party binary,
you have no way of knowing what happens with your secret data.
TIP Think twice before using environment variables to pass your Secrets to
your container, because they may get exposed inadvertently. To be safe, always
use secret volumes for exposing Secrets.
Rather than create a generic secret, you’re creating a docker-registry Secret called
mydockerhubsecret. You’re specifying your Docker Hub username, password, and
email. If you inspect the contents of the newly created Secret with kubectl describe,
you’ll see that it includes a single entry called .dockercfg. This is equivalent to the
.dockercfg file in your home directory, which is created by Docker when you run the
docker login command.
apiVersion: v1
kind: Pod
metadata:
name: private-pod
spec:
imagePullSecrets: This enables pulling images
- name: mydockerhubsecret from a private image registry.
containers:
- image: username/private:tag
name: main
In the pod definition in the previous listing, you’re specifying the mydockerhubsecret
Secret as one of the imagePullSecrets. I suggest you try this out yourself, because it’s
likely you’ll deal with private container images soon.
NOT HAVING TO SPECIFY IMAGE PULL SECRETS ON EVERY POD
Given that people usually run many different pods in their systems, it makes you won-
der if you need to add the same image pull Secrets to every pod. Luckily, that’s not the
case. In chapter 12 you’ll learn how image pull Secrets can be added to all your pods
automatically if you add the Secrets to a ServiceAccount.
224 CHAPTER 7 ConfigMaps and Secrets: configuring applications
7.6 Summary
This wraps up this chapter on how to pass configuration data to containers. You’ve
learned how to
Override the default command defined in a container image in the pod definition
Pass command-line arguments to the main container process
Set environment variables for a container
Decouple configuration from a pod specification and put it into a ConfigMap
Store sensitive data in a Secret and deliver it securely to containers
Create a docker-registry Secret and use it to pull images from a private image
registry
In the next chapter, you’ll learn how to pass pod and container metadata to applica-
tions running inside them. You’ll also see how the default token Secret, which we
learned about in this chapter, is used to talk to the API server from within a pod.
Accessing pod metadata
and other resources
from applications
Applications often need information about the environment they’re running in,
including details about themselves and that of other components in the cluster.
You’ve already seen how Kubernetes enables service discovery through environ-
ment variables or DNS, but what about other information? In this chapter, you’ll
see how certain pod and container metadata can be passed to the container and
how easy it is for an app running inside a container to talk to the Kubernetes API
server to get information about the resources deployed in the cluster and even how
to create or modify those resources.
225
226 CHAPTER 8 Accessing pod metadata and other resources from applications
Environment
variables
Pod manifest
downwardAPI
volume
Pod
Figure 8.1 The Downward API exposes pod metadata through environment variables or files.
Most of the items in the list shouldn’t require further explanation, except perhaps the
service account and CPU/memory requests and limits, which we haven’t introduced
yet. We’ll cover service accounts in detail in chapter 12. For now, all you need to know
is that a service account is the account that the pod authenticates as when talking to
the API server. CPU and memory requests and limits are explained in chapter 14.
They’re the amount of CPU and memory guaranteed to a container and the maxi-
mum amount it can get.
Most items in the list can be passed to containers either through environment vari-
ables or through a downwardAPI volume, but labels and annotations can only be
exposed through the volume. Part of the data can be acquired by other means (for
example, from the operating system directly), but the Downward API provides a sim-
pler alternative.
Let’s look at an example to pass metadata to your containerized process.
apiVersion: v1
kind: Pod
metadata:
name: downward
spec:
containers:
- name: main
image: busybox
command: ["sleep", "9999999"]
resources:
requests:
cpu: 15m
memory: 100Ki
limits:
cpu: 100m
memory: 4Mi
env:
- name: POD_NAME
228 CHAPTER 8 Accessing pod metadata and other resources from applications
When your process runs, it can look up all the environment variables you defined in
the pod spec. Figure 8.2 shows the environment variables and the sources of their val-
ues. The pod’s name, IP, and namespace will be exposed through the POD_NAME,
POD_IP, and POD_NAMESPACE environment variables, respectively. The name of the
node the container is running on will be exposed through the NODE_NAME variable.
The name of the service account is made available through the SERVICE_ACCOUNT
environment variable. You’re also creating two environment variables that will hold
the amount of CPU requested for this container and the maximum amount of mem-
ory the container is allowed to consume.
For environment variables exposing resource limits or requests, you specify a divi-
sor. The actual value of the limit or the request will be divided by the divisor and the
result exposed through the environment variable. In the previous example, you’re set-
ting the divisor for CPU requests to 1m (one milli-core, or one one-thousandth of a
CPU core). Because you’ve set the CPU request to 15m, the environment variable
CONTAINER_CPU_REQUEST_MILLICORES will be set to 15. Likewise, you set the memory
limit to 4Mi (4 mebibytes) and the divisor to 1Ki (1 Kibibyte), so the CONTAINER_MEMORY
_LIMIT_KIBIBYTES environment variable will be set to 4096.
Passing metadata through the Downward API 229
Pod manifest
metadata:
name: downward
namespace: default
Container: main spec:
nodeName: minikube
serviceAccountName: default
Environment variables containers:
POD_NAME=downward - name: main
POD_NAMESPACE=default image: busybox
POD_IP=172.17.0.4 command: ["sleep", "9999999"]
NODE_NAME=minikube resources:
SERVICE_ACCOUNT=default divisor: 1m requests:
CONTAINER_CPU_REQUEST_MILLICORES=15 cpu: 15m
CONTAINER_MEMORY_LIMIT_KIBIBYTES=4096 memory: 100Ki
limits:
divisor: 1Ki cpu: 100m
memory: 4Mi
...
Pod: downward status:
podIP: 172.17.0.4
...
Figure 8.2 Pod metadata and attributes can be exposed to the pod through environment variables.
The divisor for CPU limits and requests can be either 1, which means one whole core,
or 1m, which is one millicore. The divisor for memory limits/requests can be 1 (byte),
1k (kilobyte) or 1Ki (kibibyte), 1M (megabyte) or 1Mi (mebibyte), and so on.
After creating the pod, you can use kubectl exec to see all these environment vari-
ables in your container, as shown in the following listing.
All processes running inside the container can read those variables and use them how-
ever they need.
apiVersion: v1
kind: Pod
metadata:
name: downward
labels:
foo: bar
annotations: These labels and
key1: value1 annotations will be
key2: | exposed through the
multi downwardAPI volume.
line
value
spec:
containers:
- name: main
image: busybox
command: ["sleep", "9999999"]
resources:
requests:
cpu: 15m
memory: 100Ki
limits:
cpu: 100m
memory: 4Mi
volumeMounts: You’re mounting the
- name: downward downward volume
mountPath: /etc/downward under /etc/downward.
volumes:
- name: downward You’re defining a downwardAPI
downwardAPI: volume with the name downward.
items:
- path: "podName" The pod’s name (from the metadata.name
fieldRef: field in the manifest) will be written to
fieldPath: metadata.name the podName file.
- path: "podNamespace"
fieldRef:
fieldPath: metadata.namespace
Passing metadata through the Downward API 231
- path: "labels"
fieldRef: The pod’s labels will be written
fieldPath: metadata.labels to the /etc/downward/labels file.
- path: "annotations" The pod’s annotations will be
fieldRef: written to the /etc/downward/
fieldPath: metadata.annotations annotations file.
- path: "containerCpuRequestMilliCores"
resourceFieldRef:
containerName: main
resource: requests.cpu
divisor: 1m
- path: "containerMemoryLimitBytes"
resourceFieldRef:
containerName: main
resource: limits.memory
divisor: 1
Instead of passing the metadata through environment variables, you’re defining a vol-
ume called downward and mounting it in your container under /etc/downward. The
files this volume will contain are configured under the downwardAPI.items attribute
in the volume specification.
Each item specifies the path (the filename) where the metadata should be written
to and references either a pod-level field or a container resource field whose value you
want stored in the file (see figure 8.3).
Container: main
Pod manifest
Filesystem
/ metadata:
etc/ name: downward
downward/ namespace: default
labels:
foo: bar
annotations:
key1: value1
...
downwardAPI volume spec:
containers:
/podName
- name: main
/podNamespace
image: busybox
/labels
command: ["sleep", "9999999"]
/annotations
resources:
/containerCpuRequestMilliCores
divisor: 1m requests:
/containerMemoryLimitBytes
cpu: 15m
memory: 100Ki
limits:
Pod: downward divisor: 1 cpu: 100m
memory: 4Mi
...
Delete the previous pod and create a new one from the manifest in the previous list-
ing. Then look at the contents of the mounted downwardAPI volume directory. You
mounted the volume under /etc/downward/, so list the files in there, as shown in the
following listing.
NOTE As with the configMap and secret volumes, you can change the file
permissions through the downwardAPI volume’s defaultMode property in the
pod spec.
Each file corresponds to an item in the volume’s definition. The contents of files,
which correspond to the same metadata fields as in the previous example, are the
same as the values of environment variables you used before, so we won’t show them
here. But because you couldn’t expose labels and annotations through environment
variables before, examine the following listing for the contents of the two files you
exposed them in.
As you can see, each label/annotation is written in the key=value format on a sepa-
rate line. Multi-line values are written to a single line with newline characters denoted
with \n.
UPDATING LABELS AND ANNOTATIONS
You may remember that labels and annotations can be modified while a pod is run-
ning. As you might expect, when they change, Kubernetes updates the files holding
them, allowing the pod to always see up-to-date data. This also explains why labels and
annotations can’t be exposed through environment variables. Because environment
variable values can’t be updated afterward, if the labels or annotations of a pod were
exposed through environment variables, there’s no way to expose the new values after
they’re modified.
Talking to the Kubernetes API server 233
spec:
volumes:
- name: downward
downwardAPI:
items:
- path: "containerCpuRequestMilliCores"
resourceFieldRef: Container name
containerName: main must be specified
resource: requests.cpu
divisor: 1m
The reason for this becomes obvious if you consider that volumes are defined at the
pod level, not at the container level. When referring to a container’s resource field
inside a volume specification, you need to explicitly specify the name of the container
you’re referring to. This is true even for single-container pods.
Using volumes to expose a container’s resource requests and/or limits is slightly
more complicated than using environment variables, but the benefit is that it allows
you to pass one container’s resource fields to a different container if needed (but
both containers need to be in the same pod). With environment variables, a container
can only be passed its own resource limits and requests.
UNDERSTANDING WHEN TO USE THE DOWNWARD API
As you’ve seen, using the Downward API isn’t complicated. It allows you to keep the
application Kubernetes-agnostic. This is especially useful when you’re dealing with an
existing application that expects certain data in environment variables. The Down-
ward API allows you to expose the data to the application without having to rewrite
the application or wrap it in a shell script, which collects the data and then exposes it
through environment variables.
But the metadata available through the Downward API is fairly limited. If you need
more, you’ll need to obtain it from the Kubernetes API server directly. You’ll learn
how to do that next.
As you’ve seen throughout the book, information about services and pods can be
obtained by looking at the service-related environment variables or through DNS. But
when the app needs data about other resources or when it requires access to the most
up-to-date information as possible, it needs to talk to the API server directly (as shown
in figure 8.4).
API server
Container
App process
API objects
Before you see how apps within pods can talk to the Kubernetes API server, let’s first
explore the server’s REST endpoints from your local machine, so you can see what
talking to the API server looks like.
Because the server uses HTTPS and requires authentication, it’s not simple to talk to
it directly. You can try accessing it with curl and using curl’s --insecure (or -k)
option to skip the server certificate check, but that doesn’t get you far:
$ curl https://192.168.99.100:8443 -k
Unauthorized
Luckily, rather than dealing with authentication yourself, you can talk to the server
through a proxy by running the kubectl proxy command.
ACCESSING THE API SERVER THROUGH KUBECTL PROXY
The kubectl proxy command runs a proxy server that accepts HTTP connections on
your local machine and proxies them to the API server while taking care of authenti-
cation, so you don’t need to pass the authentication token in every request. It also
makes sure you’re talking to the actual API server and not a man in the middle (by
verifying the server’s certificate on each request).
Talking to the Kubernetes API server 235
Running the proxy is trivial. All you need to do is run the following command:
$ kubectl proxy
Starting to serve on 127.0.0.1:8001
You don’t need to pass in any other arguments, because kubectl already knows every-
thing it needs (the API server URL, authorization token, and so on). As soon as it starts
up, the proxy starts accepting connections on local port 8001. Let’s see if it works:
$ curl localhost:8001
{
"paths": [
"/api",
"/api/v1",
...
Voila! You sent the request to the proxy, it sent a request to the API server, and then
the proxy returned whatever the server returned. Now, let’s start exploring.
EXPLORING THE KUBERNETES API THROUGH THE KUBECTL PROXY
You can continue to use curl, or you can open your web browser and point it to
http://localhost:8001. Let’s examine what the API server returns when you hit its base
URL more closely. The server responds with a list of paths, as shown in the follow-
ing listing.
$ curl http://localhost:8001
{
"paths": [
"/api", Most resource types
"/api/v1", can be found here.
"/apis",
"/apis/apps",
"/apis/apps/v1beta1",
...
"/apis/batch", The batch API
"/apis/batch/v1", group and its
"/apis/batch/v2alpha1", two versions
...
These paths correspond to the API groups and versions you specify in your resource
definitions when creating resources such as Pods, Services, and so on.
You may recognize the batch/v1 in the /apis/batch/v1 path as the API group and
version of the Job resources you learned about in chapter 4. Likewise, the /api/v1
corresponds to the apiVersion: v1 you refer to in the common resources you created
(Pods, Services, ReplicationControllers, and so on). The most common resource
types, which were introduced in the earliest versions of Kubernetes, don’t belong to
236 CHAPTER 8 Accessing pod metadata and other resources from applications
any specific group, because Kubernetes initially didn’t even use the concept of API
groups; they were introduced later.
NOTE These initial resource types without an API group are now considered
to belong to the core API group.
$ curl http://localhost:8001/apis/batch
{
"kind": "APIGroup",
"apiVersion": "v1",
"name": "batch",
"versions": [
{
"groupVersion": "batch/v1",
"version": "v1"
}, The batch API
{
group contains
two versions.
"groupVersion": "batch/v2alpha1",
"version": "v2alpha1"
}
],
"preferredVersion": { Clients should use the
"groupVersion": "batch/v1", v1 version instead of
"version": "v1" v2alpha1.
},
"serverAddressByClientCIDRs": null
}
The response shows a description of the batch API group, including the available ver-
sions and the preferred version clients should use. Let’s continue and see what’s
behind the /apis/batch/v1 path. It’s shown in the following listing.
$ curl http://localhost:8001/apis/batch/v1
{
"kind": "APIResourceList",
"apiVersion": "v1",
This is a list of API resources
in the batch/v1 API group.
"groupVersion": "batch/v1",
"resources": [
{ Here’s an array holding
"name": "jobs", This describes the all the resource types
"namespaced": true, Job resource, which in this group.
"kind": "Job", is namespaced.
Talking to the Kubernetes API server 237
"verbs": [
"create",
"delete", Here are the verbs that can be used
"deletecollection", with this resource (you can create
"get", Jobs; delete individual ones or a
"list", collection of them; and retrieve,
"patch", watch, and update them).
"update",
"watch"
]
},
{ Resources also have a
"name": "jobs/status", special REST endpoint for
"namespaced": true, modifying their status.
"kind": "Job",
"verbs": [
"get", The status can be
"patch",
retrieved, patched,
or updated.
"update"
]
}
]
}
As you can see, the API server returns a list of resource types and REST endpoints in
the batch/v1 API group. One of those is the Job resource. In addition to the name of
the resource and the associated kind, the API server also includes information on
whether the resource is namespaced or not, its short name (if it has one; Jobs don’t),
and a list of verbs you can use with the resource.
The returned list describes the REST resources exposed in the API server. The
"name": "jobs" line tells you that the API contains the /apis/batch/v1/jobs end-
point. The "verbs" array says you can retrieve, update, and delete Job resources
through that endpoint. For certain resources, additional API endpoints are also
exposed (such as the jobs/status path, which allows modifying only the status of
a Job).
LISTING ALL JOB INSTANCES IN THE CLUSTER
To get a list of Jobs in your cluster, perform a GET request on path /apis/batch/
v1/jobs, as shown in the following listing.
$ curl http://localhost:8001/apis/batch/v1/jobs
{
"kind": "JobList",
"apiVersion": "batch/v1",
"metadata": {
"selfLink": "/apis/batch/v1/jobs",
"resourceVersion": "225162"
},
238 CHAPTER 8 Accessing pod metadata and other resources from applications
"items": [
{
"metadata": {
"name": "my-job",
"namespace": "default",
...
You probably have no Job resources deployed in your cluster, so the items array will be
empty. You can try deploying the Job in Chapter08/my-job.yaml and hitting the REST
endpoint again to get the same output as in listing 8.10.
RETRIEVING A SPECIFIC JOB INSTANCE BY NAME
The previous endpoint returned a list of all Jobs across all namespaces. To get back
only one specific Job, you need to specify its name and namespace in the URL. To
retrieve the Job shown in the previous listing (name: my-job; namespace: default),
you need to request the following path: /apis/batch/v1/namespaces/default/jobs/
my-job, as shown in the following listing.
$ curl http://localhost:8001/apis/batch/v1/namespaces/default/jobs/my-job
{
"kind": "Job",
"apiVersion": "batch/v1",
"metadata": {
"name": "my-job",
"namespace": "default",
...
As you can see, you get back the complete JSON definition of the my-job Job resource,
exactly like you do if you run:
You’ve seen that you can browse the Kubernetes REST API server without using any
special tools, but to fully explore the REST API and interact with it, a better option is
described at the end of this chapter. For now, exploring it with curl like this is enough
to make you understand how an application running in a pod talks to Kubernetes.
Listing 8.12 A pod for trying out communication with the API server: curl.yaml
apiVersion: v1
kind: Pod
metadata:
Using the tutum/curl image,
because you need curl
name: curl
available in the container
spec:
containers:
- name: main You’re running the sleep
image: tutum/curl command with a long delay to
command: ["sleep", "9999999"] keep your container running.
After creating the pod, run kubectl exec to run a bash shell inside its container:
$ kubectl exec -it curl bash
root@curl:/#
And you’ll remember from chapter 5 that environment variables are configured for
each service. You can get both the IP address and the port of the API server by looking
up the KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT variables (inside
the container):
root@curl:/# env | grep KUBERNETES_SERVICE
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_HOST=10.0.0.1
KUBERNETES_SERVICE_PORT_HTTPS=443
240 CHAPTER 8 Accessing pod metadata and other resources from applications
You may also remember that each service also gets a DNS entry, so you don’t even
need to look up the environment variables, but instead simply point curl to
https://kubernetes. To be fair, if you don’t know which port the service is available at,
you also either need to look up the environment variables or perform a DNS SRV
record lookup to get the service’s actual port number.
The environment variables shown previously say that the API server is listening on
port 443, which is the default port for HTTPS, so try hitting the server through
HTTPS:
Although the simplest way to get around this is to use the proposed -k option (and
this is what you’d normally use when playing with the API server manually), let’s look
at the longer (and correct) route. Instead of blindly trusting that the server you’re
connecting to is the authentic API server, you’ll verify its identity by having curl check
its certificate.
root@curl:/# ls /var/run/secrets/kubernetes.io/serviceaccount/
ca.crt namespace token
The Secret has three entries (and therefore three files in the Secret volume). Right
now, we’ll focus on the ca.crt file, which holds the certificate of the certificate author-
ity (CA) used to sign the Kubernetes API server’s certificate. To verify you’re talking to
the API server, you need to check if the server’s certificate is signed by the CA. curl
allows you to specify the CA certificate with the --cacert option, so try hitting the API
server again:
Okay, you’ve made progress. curl verified the server’s identity because its certificate
was signed by the CA you trust. As the Unauthorized response suggests, you still need
to take care of authentication. You’ll do that in a moment, but first let’s see how to
make life easier by setting the CURL_CA_BUNDLE environment variable, so you don’t
need to specify --cacert every time you run curl:
root@curl:/# export CURL_CA_BUNDLE=/var/run/secrets/kubernetes.io/
➥ serviceaccount/ca.crt
You can now hit the API server without using --cacert:
root@curl:/# curl https://kubernetes
Unauthorized
This is much nicer now. Your client (curl) trusts the API server now, but the API
server itself says you’re not authorized to access it, because it doesn’t know who
you are.
AUTHENTICATING WITH THE API SERVER
You need to authenticate with the server, so it allows you to read and even update
and/or delete the API objects deployed in the cluster. To authenticate, you need an
authentication token. Luckily, the token is provided through the default-token Secret
mentioned previously, and is stored in the token file in the secret volume. As the
Secret’s name suggests, that’s the primary purpose of the Secret.
You’re going to use the token to access the API server. First, load the token into an
environment variable:
root@curl:/# TOKEN=$(cat /var/run/secrets/kubernetes.io/
➥ serviceaccount/token)
The token is now stored in the TOKEN environment variable. You can use it when send-
ing requests to the API server, as shown in the following listing.
This gives all service accounts (we could also say all pods) cluster-admin privileges,
allowing them to do whatever they want. Obviously, doing this is dangerous and
should never be done on production clusters. For test purposes, it’s fine.
As you can see, you passed the token inside the Authorization HTTP header in the
request. The API server recognized the token as authentic and returned a proper
response. You can now explore all the resources in your cluster, the way you did a few
sections ago.
For example, you could list all the pods in the same namespace. But first you need
to know what namespace the curl pod is running in.
GETTING THE NAMESPACE THE POD IS RUNNING IN
In the first part of this chapter, you saw how to pass the namespace to the pod
through the Downward API. But if you’re paying attention, you probably noticed
your secret volume also contains a file called namespace. It contains the name-
space the pod is running in, so you can read the file instead of having to explicitly
pass the namespace to your pod through an environment variable. Load the con-
tents of the file into the NS environment variable and then list all the pods, as shown
in the following listing.
And there you go. By using the three files in the mounted secret volume directory,
you listed all the pods running in the same namespace as your pod. In the same man-
ner, you could also retrieve other API objects and even update them by sending PUT or
PATCH instead of simple GET requests.
Talking to the Kubernetes API server 243
DEFINITION CRUD stands for Create, Read, Update, and Delete. The corre-
sponding HTTP methods are POST, GET, PATCH/PUT, and DELETE, respectively.
All three aspects of pod to API server communication are displayed in figure 8.5.
Container
Filesystem App
/
var/
Server
run/ Validate API server
certificate
secrets/ certificate
kubernetes.io/
serviceaccount/
GET /api/v1/namespaces/<namespace>/pods
Authorization: Bearer <token>
Pod
Figure 8.5 Using the files from the default-token Secret to talk to the API server
Remember the kubectl proxy command we mentioned in section 8.2.1? You ran
the command on your local machine to make it easier to access the API server. Instead
of sending requests to the API server directly, you sent them to the proxy and let it
take care of authentication, encryption, and server verification. The same method can
be used inside your pods, as well.
INTRODUCING THE AMBASSADOR CONTAINER PATTERN
Imagine having an application that (among other things) needs to query the API
server. Instead of it talking to the API server directly, as you did in the previous sec-
tion, you can run kubectl proxy in an ambassador container alongside the main con-
tainer and communicate with the API server through it.
Instead of talking to the API server directly, the app in the main container can con-
nect to the ambassador through HTTP (instead of HTTPS) and let the ambassador
proxy handle the HTTPS connection to the API server, taking care of security trans-
parently (see figure 8.6). It does this by using the files from the default token’s secret
volume.
HTTP HTTPS
Container: Container:
API server
main ambassador
Pod
Because all containers in a pod share the same loopback network interface, your app
can access the proxy through a port on localhost.
RUNNING THE CURL POD WITH AN ADDITIONAL AMBASSADOR CONTAINER
To see the ambassador container pattern in action, you’ll create a new pod like the
curl pod you created earlier, but this time, instead of running a single container in
the pod, you’ll run an additional ambassador container based on a general-purpose
kubectl-proxy container image I’ve created and pushed to Docker Hub. You’ll find
the Dockerfile for the image in the code archive (in /Chapter08/kubectl-proxy/) if
you want to build it yourself.
The pod’s manifest is shown in the following listing.
apiVersion: v1
kind: Pod
metadata:
name: curl-with-ambassador
spec:
containers:
- name: main
Talking to the Kubernetes API server 245
image: tutum/curl
command: ["sleep", "9999999"]
- name: ambassador The ambassador container,
image: luksa/kubectl-proxy:1.6.2 running the kubectl-proxy image
The pod spec is almost the same as before, but with a different pod name and an addi-
tional container. Run the pod and then enter the main container with
Your pod now has two containers, and you want to run bash in the main container,
hence the -c main option. You don’t need to specify the container explicitly if you
want to run the command in the pod’s first container. But if you want to run a com-
mand inside any other container, you do need to specify the container’s name using
the -c option.
TALKING TO THE API SERVER THROUGH THE AMBASSADOR
Next you’ll try connecting to the API server through the ambassador container. By
default, kubectl proxy binds to port 8001, and because both containers in the pod
share the same network interfaces, including loopback, you can point curl to local-
host:8001, as shown in the following listing.
Listing 8.16 Accessing the API server through the ambassador container
Success! The output printed by curl is the same response you saw earlier, but this time
you didn’t need to deal with authentication tokens and server certificates.
To get a clear picture of what exactly happened, refer to figure 8.7. curl sent the
plain HTTP request (without any authentication headers) to the proxy running inside
the ambassador container, and then the proxy sent an HTTPS request to the API
server, handling the client authentication by sending the token and checking the
server’s identity by validating its certificate.
This is a great example of how an ambassador container can be used to hide the
complexities of connecting to an external service and simplify the app running in
the main container. The ambassador container is reusable across many different apps,
regardless of what language the main app is written in. The downside is that an addi-
tional process is running and consuming additional resources.
246 CHAPTER 8 Accessing pod metadata and other resources from applications
Container: main
sleep curl
GET http://localhost:8001
Port 8001
Container: ambassador
Figure 8.7 Offloading encryption, authentication, and server verification to kubectl proxy in an
ambassador container
In addition to the two officially supported libraries, here’s a list of user-contributed cli-
ent libraries for many other languages:
Java client by Fabric8—https://github.com/fabric8io/kubernetes-client
Java client by Amdatu—https://bitbucket.org/amdatulabs/amdatu-kubernetes
Node.js client by tenxcloud—https://github.com/tenxcloud/node-kubernetes-client
Node.js client by GoDaddy—https://github.com/godaddy/kubernetes-client
PHP—https://github.com/devstub/kubernetes-api-php-client
Another PHP client—https://github.com/maclof/kubernetes-client
Talking to the Kubernetes API server 247
Ruby—https://github.com/Ch00k/kubr
Another Ruby client—https://github.com/abonas/kubeclient
Clojure—https://github.com/yanatan16/clj-kubernetes-api
Scala—https://github.com/doriordan/skuber
Perl—https://metacpan.org/pod/Net::Kubernetes
These libraries usually support HTTPS and take care of authentication, so you won’t
need to use the ambassador container.
AN EXAMPLE OF INTERACTING WITH KUBERNETES WITH THE FABRIC8 JAVA CLIENT
To give you a sense of how client libraries enable you to talk to the API server, the fol-
lowing listing shows an example of how to list services in a Java app using the Fabric8
Kubernetes client.
Listing 8.17 Listing, creating, updating, and deleting pods with the Fabric8 Java client
import java.util.Arrays;
import io.fabric8.kubernetes.api.model.Pod;
import io.fabric8.kubernetes.api.model.PodList;
import io.fabric8.kubernetes.client.DefaultKubernetesClient;
import io.fabric8.kubernetes.client.KubernetesClient;
// create a pod
System.out.println("Creating a pod");
Pod pod = client.pods().inNamespace("default")
.createNew()
.withNewMetadata()
.withName("programmatically-created-pod")
.endMetadata()
.withNewSpec()
.addNewContainer()
.withName("main")
.withImage("busybox")
.withCommand(Arrays.asList("sleep", "99999"))
.endContainer()
.endSpec()
.done();
System.out.println("Created pod: " + pod);
.addToLabels("foo", "bar")
.endMetadata()
.done();
System.out.println("Added label foo=bar to pod");
The code should be self-explanatory, especially because the Fabric8 client exposes
a nice, fluent Domain-Specific-Language (DSL) API, which is easy to read and
understand.
BUILDING YOUR OWN LIBRARY WITH SWAGGER AND OPENAPI
If no client is available for your programming language of choice, you can use the
Swagger API framework to generate the client library and documentation. The Kuber-
netes API server exposes Swagger API definitions at /swaggerapi and OpenAPI spec at
/swagger.json.
To find out more about the Swagger framework, visit the website at http://swagger.io.
EXPLORING THE API WITH SWAGGER UI
Earlier in the chapter I said I’d point you to a better way of exploring the REST API
instead of hitting the REST endpoints with curl. Swagger, which I mentioned in the
previous section, is not just a tool for specifying an API, but also provides a web UI for
exploring REST APIs if they expose the Swagger API definitions. The better way of
exploring REST APIs is through this UI.
Kubernetes not only exposes the Swagger API, but it also has Swagger UI inte-
grated into the API server, though it’s not enabled by default. You can enable it by
running the API server with the --enable-swagger-ui=true option.
TIP If you’re using Minikube, you can enable Swagger UI when starting the
cluster: minikube start --extra-config=apiserver.Features.Enable-
SwaggerUI=true
After you enable the UI, you can open it in your browser by pointing it to:
http(s)://<api server>:<port>/swagger-ui
I urge you to give Swagger UI a try. It not only allows you to browse the Kubernetes
API, but also interact with it (you can POST JSON resource manifests, PATCH resources,
or DELETE them, for example).
Summary 249
8.3 Summary
After reading this chapter, you now know how your app, running inside a pod, can get
data about itself, other pods, and other components deployed in the cluster. You’ve
learned
How a pod’s name, namespace, and other metadata can be exposed to the pro-
cess either through environment variables or files in a downwardAPI volume
How CPU and memory requests and limits are passed to your app in any unit
the app requires
How a pod can use downwardAPI volumes to get up-to-date metadata, which
may change during the lifetime of the pod (such as labels and annotations)
How you can browse the Kubernetes REST API through kubectl proxy
How pods can find the API server’s location through environment variables or
DNS, similar to any other Service defined in Kubernetes
How an application running in a pod can verify that it’s talking to the API
server and how it can authenticate itself
How using an ambassador container can make talking to the API server from
within an app much simpler
How client libraries can get you interacting with Kubernetes in minutes
In this chapter, you learned how to talk to the API server, so the next step is learning
more about how it works. You’ll do that in chapter 11, but before we dive into such
details, you still need to learn about two other Kubernetes resources—Deployments
and StatefulSets. They’re explained in the next two chapters.
Deployments: updating
applications declaratively
You now know how to package your app components into containers, group them
into pods, provide them with temporary or permanent storage, pass both secret
and non-secret config data to them, and allow pods to find and talk to each other.
You know how to run a full-fledged system composed of independently running
smaller components—microservices, if you will. Is there anything else?
Eventually, you’re going to want to update your app. This chapter covers how to
update apps running in a Kubernetes cluster and how Kubernetes helps you move
toward a true zero-downtime update process. Although this can be achieved using
only ReplicationControllers or ReplicaSets, Kubernetes also provides a Deployment
250
Updating applications running in pods 251
resource that sits on top of ReplicaSets and enables declarative application updates. If
you’re not completely sure what that means, keep reading—it’s not as complicated as
it sounds.
Clients Service
ReplicationController
Figure 9.1 The basic outline of an
or ReplicaSet
application running in Kubernetes
Initially, the pods run the first version of your application—let’s suppose its image is
tagged as v1. You then develop a newer version of the app and push it to an image
repository as a new image, tagged as v2. You’d next like to replace all the pods with
this new version. Because you can’t change an existing pod’s image after the pod is
created, you need to remove the old pods and replace them with new ones running
the new image.
You have two ways of updating all those pods. You can do one of the following:
Delete all existing pods first and then start the new ones.
Start new ones and, once they’re up, delete the old ones. You can do this either
by adding all the new pods and then deleting all the old ones at once, or
sequentially, by adding new pods and removing old ones gradually.
Both these strategies have their benefits and drawbacks. The first option would lead to
a short period of time when your application is unavailable. The second option
requires your app to handle running two versions of the app at the same time. If your
app stores data in a data store, the new version shouldn’t modify the data schema or
the data in such a way that breaks the previous version.
252 CHAPTER 9 Deployments: updating applications declaratively
How do you perform these two update methods in Kubernetes? First, let’s look at
how to do this manually; then, once you know what’s involved in the process, you’ll
learn how to have Kubernetes perform the update automatically.
9.1.1 Deleting old pods and replacing them with new ones
You already know how to get a ReplicationController to replace all its pod instances
with pods running a new version. You probably remember the pod template of a
ReplicationController can be updated at any time. When the ReplicationController
creates new instances, it uses the updated pod template to create them.
If you have a ReplicationController managing a set of v1 pods, you can easily
replace them by modifying the pod template so it refers to version v2 of the image and
then deleting the old pod instances. The ReplicationController will notice that no
pods match its label selector and it will spin up new instances. The whole process is
shown in figure 9.2.
Pod: v1 Pod: v1 Pod: v1 Pod: v1 Pod: v1 Pod: v1 Pod: v1 Pod: v1 Pod: v1 Pod: v2 Pod: v2 Pod: v2
Figure 9.2 Updating pods by changing a ReplicationController’s pod template and deleting old Pods
This is the simplest way to update a set of pods, if you can accept the short downtime
between the time the old pods are deleted and new ones are started.
9.1.2 Spinning up new pods and then deleting the old ones
If you don’t want to see any downtime and your app supports running multiple ver-
sions at once, you can turn the process around and first spin up all the new pods and
Updating applications running in pods 253
only then delete the old ones. This will require more hardware resources, because
you’ll have double the number of pods running at the same time for a short while.
This is a slightly more complex method compared to the previous one, but you
should be able to do it by combining what you’ve learned about ReplicationControl-
lers and Services so far.
SWITCHING FROM THE OLD TO THE NEW VERSION AT ONCE
Pods are usually fronted by a Service. It’s possible to have the Service front only the
initial version of the pods while you bring up the pods running the new version. Then,
once all the new pods are up, you can change the Service’s label selector and have the
Service switch over to the new pods, as shown in figure 9.3. This is called a blue-green
deployment. After switching over, and once you’re sure the new version functions cor-
rectly, you’re free to delete the old pods by deleting the old ReplicationController.
NOTE You can change a Service’s pod selector with the kubectl set selec-
tor command.
Service Service
Pod: v1 Pod: v1 Pod: v1 Pod: v2 Pod: v2 Pod: v2 Pod: v1 Pod: v1 Pod: v1 Pod: v2 Pod: v2 Pod: v2
Figure 9.3 Switching a Service from the old pods to the new ones
Pod: v1 Pod: v1 Pod: v1 Pod: v1 Pod: v1 Pod: v2 Pod: v1 Pod: v2 Pod: v2 Pod: v2 Pod: v2 Pod: v2
v1 v2 v1 v2 v1 v2 v1 v2
RUNNING THE APP AND EXPOSING IT THROUGH A SERVICE USING A SINGLE YAML FILE
To run your app, you’ll create a ReplicationController and a LoadBalancer Service to
enable you to access the app externally. This time, rather than create these two
resources separately, you’ll create a single YAML for both of them and post it to the
Kubernetes API with a single kubectl create command. A YAML manifest can con-
tain multiple objects delimited with a line containing three dashes, as shown in the
following listing.
apiVersion: v1
kind: ReplicationController
metadata:
name: kubia-v1
spec:
replicas: 3
template:
metadata:
name: kubia
labels:
app: kubia
spec:
containers: You’re creating a
- image: luksa/kubia:v1 ReplicationController for
name: nodejs pods running this image.
--- The Service fronts all
apiVersion: v1 YAML files can contain pods created by the
kind: Service multiple resource ReplicationController.
metadata: definitions separated by
name: kubia a line with three dashes.
spec:
type: LoadBalancer
selector:
app: kubia
ports:
- port: 80
targetPort: 8080
Listing 9.3 Getting the Service’s external IP and hitting the service in a loop with curl
NOTE If you’re using Minikube or any other Kubernetes cluster where load
balancer services aren’t supported, you can use the Service’s node port to
access the app. This was explained in chapter 5.
This new version is available in the image luksa/kubia:v2 on Docker Hub, so you
don’t need to build it yourself.
Keep the curl loop running and open another terminal, where you’ll get the rolling
update started. To perform the update, you’ll run the kubectl rolling-update com-
mand. All you need to do is tell it which ReplicationController you’re replacing, give a
name for the new ReplicationController, and specify the new image you’d like to
replace the original one with. The following listing shows the full command for per-
forming the rolling update.
Figure 9.5 The state of the system immediately after starting the rolling update
Listing 9.5 Describing the new ReplicationController created by the rolling update
UNDERSTANDING THE STEPS PERFORMED BY KUBECTL BEFORE THE ROLLING UPDATE COMMENCES
kubectl created this ReplicationController by copying the kubia-v1 controller and
changing the image in its pod template. If you look closely at the controller’s label
selector, you’ll notice it has been modified, too. It includes not only a simple
app=kubia label, but also an additional deployment label which the pods must have in
order to be managed by this ReplicationController.
You probably know this already, but this is necessary to avoid having both the new
and the old ReplicationControllers operating on the same set of pods. But even if pods
created by the new controller have the additional deployment label in addition to the
app=kubia label, doesn’t this mean they’ll be selected by the first ReplicationControl-
ler’s selector, because it’s set to app=kubia?
Yes, that’s exactly what would happen, but there’s a catch. The rolling-update pro-
cess has modified the selector of the first ReplicationController, as well:
$ kubectl describe rc kubia-v1
Name: kubia-v1
Namespace: default
Image(s): luksa/kubia:v1
Selector: app=kubia,deployment=3ddd307978b502a5b975ed4045ae4964-orig
Okay, but doesn’t this mean the first controller now sees zero pods matching its selec-
tor, because the three pods previously created by it contain only the app=kubia label?
No, because kubectl had also modified the labels of the live pods just before modify-
ing the ReplicationController’s selector:
$ kubectl get po --show-labels
NAME READY STATUS RESTARTS AGE LABELS
kubia-v1-m33mv 1/1 Running 0 2m app=kubia,deployment=3ddd...
kubia-v1-nmzw9 1/1 Running 0 2m app=kubia,deployment=3ddd...
kubia-v1-cdtey 1/1 Running 0 2m app=kubia,deployment=3ddd...
If this is getting too complicated, examine figure 9.6, which shows the pods, their
labels, and the two ReplicationControllers, along with their pod selectors.
Figure 9.6 Detailed state of the old and new ReplicationControllers and pods at the start of a rolling
update
Performing an automatic rolling update with a ReplicationController 259
kubectl had to do all this before even starting to scale anything up or down. Now
imagine doing the rolling update manually. It’s easy to see yourself making a mistake
here and possibly having the ReplicationController kill off all your pods—pods that
are actively serving your production clients!
REPLACING OLD PODS WITH NEW ONES BY SCALING THE TWO REPLICATIONCONTROLLERS
After setting up all this, kubectl starts replacing pods by first scaling up the new
controller to 1. The controller thus creates the first v2 pod. kubectl then scales
down the old ReplicationController by 1. This is shown in the next two lines printed
by kubectl:
Scaling kubia-v2 up to 1
Scaling kubia-v1 down to 2
Because the Service is targeting all pods with the app=kubia label, you should start see-
ing your curl requests redirected to the new v2 pod every few loop iterations:
This is v2 running in pod kubia-v2-nmzw9
This is v1 running in pod kubia-v1-kbtsk Requests hitting the pod
This is v1 running in pod kubia-v1-2321o running the new version
This is v2 running in pod kubia-v2-nmzw9
...
Service
curl Selector: app=kubia
Figure 9.7 The Service is redirecting requests to both the old and new pods during the
rolling update.
As kubectl continues with the rolling update, you start seeing a progressively bigger
percentage of requests hitting v2 pods, as the update process deletes more of the v1
pods and replaces them with those running your new image. Eventually, the original
260 CHAPTER 9 Deployments: updating applications declaratively
...
Scaling kubia-v2 up to 2
Scaling kubia-v1 down to 1
Scaling kubia-v2 up to 3
Scaling kubia-v1 down to 0
Update succeeded. Deleting kubia-v1
replicationcontroller "kubia-v1" rolling updated to "kubia-v2"
You’re now left with only the kubia-v2 ReplicationController and three v2 pods. All
throughout this update process, you’ve hit your service and gotten a response every
time. You have, in fact, performed a rolling update with zero downtime.
TIP Using the --v 6 option increases the logging level enough to let you see
the requests kubectl is sending to the API server.
Using this option, kubectl will print out each HTTP request it sends to the Kuberne-
tes API server. You’ll see PUT requests to
/api/v1/namespaces/default/replicationcontrollers/kubia-v1
that the kubectl client is the one doing the scaling, instead of it being performed by
the Kubernetes master.
TIP Use the verbose logging option when running other kubectl commands,
to learn more about the communication between kubectl and the API server.
But why is it such a bad thing that the update process is being performed by the client
instead of on the server? Well, in your case, the update went smoothly, but what if you
lost network connectivity while kubectl was performing the update? The update pro-
cess would be interrupted mid-way. Pods and ReplicationControllers would end up in
an intermediate state.
Another reason why performing an update like this isn’t as good as it could be is
because it’s imperative. Throughout this book, I’ve stressed how Kubernetes is about
you telling it the desired state of the system and having Kubernetes achieve that
state on its own, by figuring out the best way to do it. This is how pods are deployed
and how pods are scaled up and down. You never tell Kubernetes to add an addi-
tional pod or remove an excess one—you change the number of desired replicas
and that’s it.
Similarly, you will also want to change the desired image tag in your pod defini-
tions and have Kubernetes replace the pods with new ones running the new image.
This is exactly what drove the introduction of a new resource called a Deployment,
which is now the preferred way of deploying applications in Kubernetes.
You might wonder why you’d want to complicate things by introducing another object
on top of a ReplicationController or ReplicaSet, when they’re what suffices to keep a set
of pod instances running. As the rolling update example in section 9.2 demonstrates,
when updating the app, you need to introduce an additional ReplicationController and
262 CHAPTER 9 Deployments: updating applications declaratively
coordinate the two controllers to dance around each other without stepping on each
other’s toes. You need something coordinating this dance. A Deployment resource
takes care of that (it’s not the Deployment resource itself, but the controller process
running in the Kubernetes control plane that does that; but we’ll get to that in chap-
ter 11).
Using a Deployment instead of the lower-level constructs makes updating an app
much easier, because you’re defining the desired state through the single Deployment
resource and letting Kubernetes take care of the rest, as you’ll see in the next few pages.
apiVersion: apps/v1beta1
Deployments are in the apps
kind: Deployment API group, version v1beta1.
metadata:
name: kubia
You’ve changed the kind
spec: from ReplicationController
replicas: 3 to Deployment.
template:
metadata:
There’s no need to include
name: kubia
the version in the name of
labels: the Deployment.
app: kubia
spec:
containers:
- image: luksa/kubia:v1
name: nodejs
Because the ReplicationController from before was managing a specific version of the
pods, you called it kubia-v1. A Deployment, on the other hand, is above that version
stuff. At a given point in time, the Deployment can have multiple pod versions run-
ning under its wing, so its name shouldn’t reference the app version.
Using Deployments for updating apps declaratively 263
TIP Be sure to include the --record command-line option when creating it.
This records the command in the revision history, which will be useful later.
DISPLAYING THE STATUS OF THE DEPLOYMENT ROLLOUT
You can use the usual kubectl get deployment and the kubectl describe deployment
commands to see details of the Deployment, but let me point you to an additional
command, which is made specifically for checking a Deployment’s status:
$ kubectl rollout status deployment kubia
deployment kubia successfully rolled out
According to this, the Deployment has been successfully rolled out, so you should see
the three pod replicas up and running. Let’s see:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-1506449474-otnnh 1/1 Running 0 14s
kubia-1506449474-vmn7s 1/1 Running 0 14s
kubia-1506449474-xis6m 1/1 Running 0 14s
UNDERSTANDING HOW DEPLOYMENTS CREATE REPLICASETS, WHICH THEN CREATE THE PODS
Take note of the names of these pods. Earlier, when you used a ReplicationController
to create pods, their names were composed of the name of the controller plus a ran-
domly generated string (for example, kubia-v1-m33mv). The three pods created by
the Deployment include an additional numeric value in the middle of their names.
What is that exactly?
The number corresponds to the hashed value of the pod template in the Deploy-
ment and the ReplicaSet managing these pods. As we said earlier, a Deployment
doesn’t manage pods directly. Instead, it creates ReplicaSets and leaves the managing
to them, so let’s look at the ReplicaSet created by your Deployment:
$ kubectl get replicasets
NAME DESIRED CURRENT AGE
kubia-1506449474 3 3 10s
The ReplicaSet’s name also contains the hash value of its pod template. As you’ll see
later, a Deployment creates multiple ReplicaSets—one for each version of the pod
264 CHAPTER 9 Deployments: updating applications declaratively
template. Using the hash value of the pod template like this allows the Deployment
to always use the same (possibly existing) ReplicaSet for a given version of the pod
template.
ACCESSING THE PODS THROUGH THE SERVICE
With the three replicas created by this ReplicaSet now running, you can use the Ser-
vice you created a while ago to access them, because you made the new pods’ labels
match the Service’s label selector.
Up until this point, you probably haven’t seen a good-enough reason why you should
use Deployments over ReplicationControllers. Luckily, creating a Deployment also hasn’t
been any harder than creating a ReplicationController. Now, you’ll start doing things
with this Deployment, which will make it clear why Deployments are superior. This will
become clear in the next few moments, when you see how updating the app through
a Deployment resource compares to updating it through a ReplicationController.
The RollingUpdate strategy, on the other hand, removes old pods one by one,
while adding new ones at the same time, keeping the application available throughout
the whole process, and ensuring there’s no drop in its capacity to handle requests.
This is the default strategy. The upper and lower limits for the number of pods above
or below the desired replica count are configurable. You should use this strategy only
when your app can handle running both the old and new version at the same time.
SLOWING DOWN THE ROLLING UPDATE FOR DEMO PURPOSES
In the next exercise, you’ll use the RollingUpdate strategy, but you need to slow down
the update process a little, so you can see that the update is indeed performed in a
rolling fashion. You can do that by setting the minReadySeconds attribute on the
Deployment. We’ll explain what this attribute does by the end of this chapter. For
now, set it to 10 seconds with the kubectl patch command.
$ kubectl patch deployment kubia -p '{"spec": {"minReadySeconds": 10}}'
"kubia" patched
TIP The kubectl patch command is useful for modifying a single property
or a limited number of properties of a resource without having to edit its defi-
nition in a text editor.
You used the patch command to change the spec of the Deployment. This doesn’t
cause any kind of update to the pods, because you didn’t change the pod template.
Changing other Deployment properties, like the desired replica count or the deploy-
ment strategy, also doesn’t trigger a rollout, because it doesn’t affect the existing indi-
vidual pods in any way.
TRIGGERING THE ROLLING UPDATE
If you’d like to track the update process as it progresses, first run the curl loop again
in another terminal to see what’s happening with the requests (don’t forget to replace
the IP with the actual external IP of your service):
To trigger the actual rollout, you’ll change the image used in the single pod container
to luksa/kubia:v2. Instead of editing the whole YAML of the Deployment object or
using the patch command to change the image, you’ll use the kubectl set image
command, which allows changing the image of any resource that contains a container
(ReplicationControllers, ReplicaSets, Deployments, and so on). You’ll use it to modify
your Deployment like this:
$ kubectl set image deployment kubia nodejs=luksa/kubia:v2
deployment "kubia" image updated
When you execute this command, you’re updating the kubia Deployment’s pod tem-
plate so the image used in its nodejs container is changed to luksa/kubia:v2 (from
:v1). This is shown in figure 9.9.
266 CHAPTER 9 Deployments: updating applications declaratively
Deployment Deployment
Container: Container:
:v1 :v2 :v1 :v2
nodejs nodejs
kubectl edit Opens the object’s manifest in your default editor. After making
changes, saving the file, and exiting the editor, the object is updated.
Example: kubectl edit deployment kubia
kubectl apply Modifies the object by applying property values from a full YAML or
JSON file. If the object specified in the YAML/JSON doesn’t exist yet,
it’s created. The file needs to contain the full definition of the
resource (it can’t include only the fields you want to update, as is the
case with kubectl patch).
Example: kubectl apply -f kubia-deployment-v2.yaml
kubectl replace Replaces the object with a new one from a YAML/JSON file. In con-
trast to the apply command, this command requires the object to
exist; otherwise it prints an error.
Example: kubectl replace -f kubia-deployment-v2.yaml
kubectl set image Changes the container image defined in a Pod, ReplicationControl-
ler’s template, Deployment, DaemonSet, Job, or ReplicaSet.
Example: kubectl set image deployment kubia
nodejs=luksa/kubia:v2
All these methods are equivalent as far as Deployments go. What they do is change
the Deployment’s specification. This change then triggers the rollout process.
Using Deployments for updating apps declaratively 267
If you’ve run the curl loop, you’ll see requests initially hitting only the v1 pods; then
more and more of them hit the v2 pods until, finally, all of them hit only the remain-
ing v2 pods, after all v1 pods are deleted. This works much like the rolling update per-
formed by kubectl.
UNDERSTANDING THE AWESOMENESS OF DEPLOYMENTS
Let’s think about what has happened. By changing the pod template in your Deploy-
ment resource, you’ve updated your app to a newer version—by changing a single
field!
The controllers running as part of the Kubernetes control plane then performed
the update. The process wasn’t performed by the kubectl client, like it was when you
used kubectl rolling-update. I don’t know about you, but I think that’s simpler than
having to run a special command telling Kubernetes what to do and then waiting
around for the process to be completed.
The events that occurred below the Deployment’s surface during the update are simi-
lar to what happened during the kubectl rolling-update. An additional ReplicaSet
was created and it was then scaled up slowly, while the previous ReplicaSet was scaled
down to zero (the initial and final states are shown in figure 9.10).
Before After
ReplicaSet: v1
Pods: v1 ReplicaSet: v1
Replicas: --
Deployment Deployment
ReplicaSet: v2
ReplicaSet: v2 Pods: v2
Replicas: ++
You can still see the old ReplicaSet next to the new one if you list them:
$ kubectl get rs
NAME DESIRED CURRENT AGE
kubia-1506449474 0 0 24m
kubia-1581357123 3 3 23m
268 CHAPTER 9 Deployments: updating applications declaratively
Similar to ReplicationControllers, all your new pods are now managed by the new
ReplicaSet. Unlike before, the old ReplicaSet is still there, whereas the old Replication-
Controller was deleted at the end of the rolling-update process. You’ll soon see what
the purpose of this inactive ReplicaSet is.
But you shouldn’t care about ReplicaSets here, because you didn’t create them
directly. You created and operated only on the Deployment resource; the underlying
ReplicaSets are an implementation detail. You’ll agree that managing a single Deploy-
ment object is much easier compared to dealing with and keeping track of multiple
ReplicationControllers.
Although this difference may not be so apparent when everything goes well with a
rollout, it becomes much more obvious when you hit a problem during the rollout
process. Let’s simulate one problem right now.
var requestCount = 0;
As you can see, on the fifth and all subsequent requests, the code returns a 500 error
with the message “Some internal error has occurred...”
Using Deployments for updating apps declaratively 269
DEPLOYING VERSION 3
I’ve made the v3 version of the image available as luksa/kubia:v3. You’ll deploy this
new version by changing the image in the Deployment specification again:
$ kubectl set image deployment kubia nodejs=luksa/kubia:v3
deployment "kubia" image updated
You can follow the progress of the rollout with kubectl rollout status:
$ kubectl rollout status deployment kubia
Waiting for rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
Waiting for rollout to finish: 1 old replicas are pending termination...
deployment "kubia" successfully rolled out
The new version is now live. As the following listing shows, after a few requests, your
web clients start receiving errors.
UNDOING A ROLLOUT
You can’t have your users experiencing internal server errors, so you need to do some-
thing about it fast. In section 9.3.6 you’ll see how to block bad rollouts automatically,
but for now, let’s see what you can do about your bad rollout manually. Luckily,
Deployments make it easy to roll back to the previously deployed version by telling
Kubernetes to undo the last rollout of a Deployment:
$ kubectl rollout undo deployment kubia
deployment "kubia" rolled back
TIP The undo command can also be used while the rollout process is still in
progress to essentially abort the rollout. Pods already created during the roll-
out process are removed and replaced with the old ones again.
270 CHAPTER 9 Deployments: updating applications declaratively
Remember the --record command-line option you used when creating the Deploy-
ment? Without it, the CHANGE-CAUSE column in the revision history would be empty,
making it much harder to figure out what’s behind each revision.
ROLLING BACK TO A SPECIFIC DEPLOYMENT REVISION
You can roll back to a specific revision by specifying the revision in the undo com-
mand. For example, if you want to roll back to the first version, you’d execute the fol-
lowing command:
$ kubectl rollout undo deployment kubia --to-revision=1
Remember the inactive ReplicaSet left over when you modified the Deployment the
first time? The ReplicaSet represents the first revision of your Deployment. All Replica-
Sets created by a Deployment represent the complete revision history, as shown in fig-
ure 9.11. Each ReplicaSet stores the complete information of the Deployment at that
specific revision, so you shouldn’t delete it manually. If you do, you’ll lose that specific
revision from the Deployment’s history, preventing you from rolling back to it.
Deployment
Revision history Current revision
But having old ReplicaSets cluttering your ReplicaSet list is not ideal, so the length of
the revision history is limited by the revisionHistoryLimit property on the Deploy-
ment resource. It defaults to two, so normally only the current and the previous revision
are shown in the history (and only the current and the previous ReplicaSet are pre-
served). Older ReplicaSets are deleted automatically.
Using Deployments for updating apps declaratively 271
maxSurge Determines how many pod instances you allow to exist above the desired replica
count configured on the Deployment. It defaults to 25%, so there can be at most
25% more pod instances than the desired count. If the desired replica count is
set to four, there will never be more than five pod instances running at the same
time during an update. When converting a percentage to an absolute number,
the number is rounded up. Instead of a percentage, the value can also be an
absolute value (for example, one or two additional pods can be allowed).
maxUnavailable Determines how many pod instances can be unavailable relative to the desired
replica count during the update. It also defaults to 25%, so the number of avail-
able pod instances must never fall below 75% of the desired replica count. Here,
when converting a percentage to an absolute number, the number is rounded
down. If the desired replica count is set to four and the percentage is 25%, only
one pod can be unavailable. There will always be at least three pod instances
available to serve requests during the whole rollout. As with maxSurge, you can
also specify an absolute value instead of a percentage.
Because the desired replica count in your case was three, and both these properties
default to 25%, maxSurge allowed the number of all pods to reach four, and
272 CHAPTER 9 Deployments: updating applications declaratively
Delete Delete
Number Wait one v1 Wait one v1 Wait
of pods Create until pod and until pod and until Delete
one it’s create one it’s create one it’s last
v2 pod available v2 pod available v2 pod available v1 pod
4 maxSurge = 1
v2 v2 v2 v2 v2 v2
3 Desired replica count = 3
v1 v1 v1 v2 v2 v2 v2 v2
2 maxUnavailable = 0
v1 v1 v1 v1 v1 v2 v2 v2
1
v1 v1 v1 v1 v1 v1 v1 v2
Time
Figure 9.12 Rolling update of a Deployment with three replicas and default maxSurge and maxUnavailable
maxUnavailable disallowed having any unavailable pods (in other words, three pods
had to be available at all times). This is shown in figure 9.12.
UNDERSTANDING THE MAXUNAVAILABLE PROPERTY
The extensions/v1beta1 version of Deployments uses different defaults—it sets both
maxSurge and maxUnavailable to 1 instead of 25%. In the case of three replicas, max-
Surge is the same as before, but maxUnavailable is different (1 instead of 0). This
makes the rollout process unwind a bit differently, as shown in figure 9.13.
Delete
Number Delete v1 two v1
of pods pod and Wait until pods and Wait
create two both are create one until it’s
v2 pods available v2 pod available
4 maxSurge = 1
v2 v2
3 Desired replica count = 3
v1 v2 v2 v2 v2
2 maxUnavailable = 1
v1 v1 v1 v2 v2
1
v1 v1 v1 v2 v2
Time
2 available 2 available
2 unavailable 1 unavailable
4 available 3 available
Figure 9.13 Rolling update of a Deployment with the maxSurge=1 and maxUnavailable=1
Using Deployments for updating apps declaratively 273
In this case, one replica can be unavailable, so if the desired replica count is three,
only two of them need to be available. That’s why the rollout process immediately
deletes one pod and creates two new ones. This ensures two pods are available and
that the maximum number of pods isn’t exceeded (the maximum is four in this
case—three plus one from maxSurge). As soon as the two new pods are available, the
two remaining old pods are deleted.
This is a bit hard to grasp, especially since the maxUnavailable property leads you
to believe that that’s the maximum number of unavailable pods that are allowed. If
you look at the previous figure closely, you’ll see two unavailable pods in the second
column even though maxUnavailable is set to 1.
It’s important to keep in mind that maxUnavailable is relative to the desired
replica count. If the replica count is set to three and maxUnavailable is set to one,
that means that the update process must always keep at least two (3 minus 1) pods
available, while the number of pods that aren’t available can exceed one.
A single new pod should have been created, but all original pods should also still be
running. Once the new pod is up, a part of all requests to the service will be redirected
to the new pod. This way, you’ve effectively run a canary release. A canary release is a
technique for minimizing the risk of rolling out a bad version of an application and it
affecting all your users. Instead of rolling out the new version to everyone, you replace
only one or a small number of old pods with new ones. This way only a small number
of users will initially hit the new version. You can then verify whether the new version
274 CHAPTER 9 Deployments: updating applications declaratively
is working fine or not and then either continue the rollout across all remaining pods
or roll back to the previous version.
RESUMING THE ROLLOUT
In your case, by pausing the rollout process, only a small portion of client requests will
hit your v4 pod, while most will still hit the v3 pods. Once you’re confident the new
version works as it should, you can resume the deployment to replace all the old pods
with new ones:
$ kubectl rollout resume deployment kubia
deployment "kubia" resumed
Obviously, having to pause the deployment at an exact point in the rollout process
isn’t what you want to do. In the future, a new upgrade strategy may do that automati-
cally, but currently, the proper way of performing a canary release is by using two dif-
ferent Deployments and scaling them appropriately.
USING THE PAUSE FEATURE TO PREVENT ROLLOUTS
Pausing a Deployment can also be used to prevent updates to the Deployment from
kicking off the rollout process, allowing you to make multiple changes to the Deploy-
ment and starting the rollout only when you’re done making all the necessary changes.
Once you’re ready for changes to take effect, you resume the Deployment and the
rollout process will start.
NOTE If a Deployment is paused, the undo command won’t undo it until you
resume the Deployment.
Although you should obviously test your pods both in a test and in a staging envi-
ronment before deploying them into production, using minReadySeconds is like an
airbag that saves your app from making a big mess after you’ve already let a buggy ver-
sion slip into production.
With a properly configured readiness probe and a proper minReadySeconds set-
ting, Kubernetes would have prevented us from deploying the buggy v3 version ear-
lier. Let me show you how.
DEFINING A READINESS PROBE TO PREVENT OUR V3 VERSION FROM BEING ROLLED OUT FULLY
You’re going to deploy version v3 again, but this time, you’ll have the proper readi-
ness probe defined on the pod. Your Deployment is currently at version v4, so before
you start, roll back to version v2 again so you can pretend this is the first time you’re
upgrading to v3. If you wish, you can go straight from v4 to v3, but the text that fol-
lows assumes you returned to v2 first.
Unlike before, where you only updated the image in the pod template, you’re now
also going to introduce a readiness probe for the container at the same time. Up until
now, because there was no explicit readiness probe defined, the container and the
pod were always considered ready, even if the app wasn’t truly ready or was returning
errors. There was no way for Kubernetes to know that the app was malfunctioning and
shouldn’t be exposed to clients.
To change the image and introduce the readiness probe at once, you’ll use the
kubectl apply command. You’ll use the following YAML to update the deployment
(you’ll store it as kubia-deployment-v3-with-readinesscheck.yaml), as shown in
the following listing.
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: kubia
spec: You’re keeping
replicas: 3 minReadySeconds
minReadySeconds: 10 set to 10.
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
You’re keeping maxUnavailable
type: RollingUpdate set to 0 to make the deployment
template: replace pods one by one
metadata:
name: kubia
labels:
app: kubia
spec:
containers:
- image: luksa/kubia:v3
276 CHAPTER 9 Deployments: updating applications declaratively
The apply command updates the Deployment with everything that’s defined in the
YAML file. It not only updates the image but also adds the readiness probe definition
and anything else you’ve added or modified in the YAML. If the new YAML also con-
tains the replicas field, which doesn’t match the number of replicas on the existing
Deployment, the apply operation will also scale the Deployment, which isn’t usually
what you want.
TIP To keep the desired replica count unchanged when updating a Deploy-
ment with kubectl apply, don’t include the replicas field in the YAML.
Running the apply command will kick off the update process, which you can again
follow with the rollout status command:
$ kubectl rollout status deployment kubia
Waiting for rollout to finish: 1 out of 3 new replicas have been updated...
Because the status says one new pod has been created, your service should be hitting it
occasionally, right? Let’s see:
$ while true; do curl http://130.211.109.222; done
This is v2 running in pod kubia-1765119474-jvslk
This is v2 running in pod kubia-1765119474-jvslk
This is v2 running in pod kubia-1765119474-xk5g3
This is v2 running in pod kubia-1765119474-pmb26
This is v2 running in pod kubia-1765119474-pmb26
This is v2 running in pod kubia-1765119474-xk5g3
...
Nope, you never hit the v3 pod. Why not? Is it even there? List the pods:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-1163142519-7ws0i 0/1 Running 0 30s
kubia-1765119474-jvslk 1/1 Running 0 9m
kubia-1765119474-pmb26 1/1 Running 0 9m
kubia-1765119474-xk5g3 1/1 Running 0 8m
Using Deployments for updating apps declaratively 277
Aha! There’s your problem (or as you’ll learn soon, your blessing)! The pod is shown
as not ready, but I guess you’ve been expecting that, right? What has happened?
UNDERSTANDING HOW A READINESS PROBE PREVENTS BAD VERSIONS FROM BEING ROLLED OUT
As soon as your new pod starts, the readiness probe starts being hit every second (you
set the probe’s interval to one second in the pod spec). On the fifth request the readi-
ness probe began failing, because your app starts returning HTTP status code 500
from the fifth request onward.
As a result, the pod is removed as an endpoint from the service (see figure 9.14).
By the time you start hitting the service in the curl loop, the pod has already been
marked as not ready. This explains why you never hit the new pod with curl. And
that’s exactly what you want, because you don’t want clients to hit a pod that’s not
functioning properly.
Pod: v3
Pod: v2 Pod: v2 Pod: v2
(unhealthy)
ReplicaSet: v2 ReplicaSet: v3
Replicas: 3 Replicas: 1
Deployment
Replicas: 3
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Figure 9.14 Deployment blocked by a failing readiness probe in the new pod
But what about the rollout process? The rollout status command shows only one
new replica has started. Thankfully, the rollout process will not continue, because the
new pod will never become available. To be considered available, it needs to be ready
for at least 10 seconds. Until it’s available, the rollout process will not create any new
pods, and it also won’t remove any original pods because you’ve set the maxUnavailable
property to 0.
278 CHAPTER 9 Deployments: updating applications declaratively
The fact that the deployment is stuck is a good thing, because if it had continued
replacing the old pods with the new ones, you’d end up with a completely non-working
service, like you did when you first rolled out version 3, when you weren’t using the
readiness probe. But now, with the readiness probe in place, there was virtually no
negative impact on your users. A few users may have experienced the internal server
error, but that’s not as big of a problem as if the rollout had replaced all pods with the
faulty version 3.
TIP If you only define the readiness probe without setting minReadySeconds
properly, new pods are considered available immediately when the first invo-
cation of the readiness probe succeeds. If the readiness probe starts failing
shortly after, the bad version is rolled out across all pods. Therefore, you
should set minReadySeconds appropriately.
The time after which the Deployment is considered failed is configurable through the
progressDeadlineSeconds property in the Deployment spec.
NOTE In future versions, the rollout will be aborted automatically when the
time specified in progressDeadlineSeconds is exceeded.
Summary 279
9.4 Summary
This chapter has shown you how to make your life easier by using a declarative
approach to deploying and updating applications in Kubernetes. Now that you’ve
read this chapter, you should know how to
Perform a rolling update of pods managed by a ReplicationController
Create Deployments instead of lower-level ReplicationControllers or ReplicaSets
Update your pods by editing the pod template in the Deployment specification
Roll back a Deployment either to the previous revision or to any earlier revision
still listed in the revision history
Abort a Deployment mid-way
Pause a Deployment to inspect how a single instance of the new version behaves
in production before allowing additional pod instances to replace the old ones
Control the rate of the rolling update through maxSurge and maxUnavailable
properties
Use minReadySeconds and readiness probes to have the rollout of a faulty ver-
sion blocked automatically
In addition to these Deployment-specific tasks, you also learned how to
Use three dashes as a separator to define multiple resources in a single YAML file
Turn on kubectl’s verbose logging to see exactly what it’s doing behind the
curtains
You now know how to deploy and manage sets of pods created from the same pod
template and thus share the same persistent storage. You even know how to update
them declaratively. But what about running sets of pods, where each instance needs to
use its own persistent storage? We haven’t looked at that yet. That’s the subject of our
next chapter.
StatefulSets:
deploying replicated
stateful applications
You now know how to run both single-instance and replicated stateless pods,
and even stateful pods utilizing persistent storage. You can run several repli-
cated web-server pod instances and you can run a single database pod instance
that uses persistent storage, provided either through plain pod volumes or through
PersistentVolumes bound by a PersistentVolumeClaim. But can you employ a
ReplicaSet to replicate the database pod?
280
Replicating stateful pods 281
Pod
Persistent
ReplicaSet Pod Persistent
Volume
Volume
Claim
Pod
Figure 10.1 All pods from the same ReplicaSet always use the same
PersistentVolumeClaim and PersistentVolume.
Because the reference to the claim is in the pod template, which is used to stamp out
multiple pod replicas, you can’t make each replica use its own separate Persistent-
VolumeClaim. You can’t use a ReplicaSet to run a distributed data store, where each
instance needs its own separate storage—at least not by using a single ReplicaSet. To
be honest, none of the API objects you’ve seen so far make running such a data store
possible. You need something else.
couldn’t change the desired replica count—you’d have to create additional Replica-
Sets instead.
Using multiple ReplicaSets is therefore not the best solution. But could you maybe
use a single ReplicaSet and have each pod instance keep its own persistent state, even
though they’re all using the same storage volume?
USING MULTIPLE DIRECTORIES IN THE SAME VOLUME
A trick you can use is to have all pods use the same PersistentVolume, but then have a
separate file directory inside that volume for each pod (this is shown in figure 10.3).
/data/1/
Persistent
ReplicaSet Pod App Volume /data/2/
Claim
/data/3/
Pod App
Figure 10.3 Working around the shared storage problem by having the app
in each pod use a different file directory
Because you can’t configure pod replicas differently from a single pod template, you
can’t tell each instance what directory it should use, but you can make each instance
automatically select (and possibly also create) a data directory that isn’t being used
by any other instance at that time. This solution does require coordination between
the instances, and isn’t easy to do correctly. It also makes the shared storage volume
the bottleneck.
new ones. When a ReplicaSet replaces a pod, the new pod is a completely new pod
with a new hostname and IP, although the data in its storage volume may be that of
the killed pod. For certain apps, starting up with the old instance’s data but with a
completely new network identity may cause problems.
Why do certain apps mandate a stable network identity? This requirement is
fairly common in distributed stateful applications. Certain apps require the adminis-
trator to list all the other cluster members and their IP addresses (or hostnames) in
each member’s configuration file. But in Kubernetes, every time a pod is resched-
uled, the new pod gets both a new hostname and a new IP address, so the whole
application cluster would have to be reconfigured every time one of its members is
rescheduled.
USING A DEDICATED SERVICE FOR EACH POD INSTANCE
A trick you can use to work around this problem is to provide a stable network address
for cluster members by creating a dedicated Kubernetes Service for each individual
member. Because service IPs are stable, you can then point to each member through
its service IP (rather than the pod IP) in the configuration.
This is similar to creating a ReplicaSet for each member to provide them with indi-
vidual storage, as described previously. Combining these two techniques results in the
setup shown in figure 10.4 (an additional service covering all the cluster members is
also shown, because you usually need one for clients of the cluster).
Service A
Service A1
Service A2
The solution is not only ugly, but it still doesn’t solve everything. The individual pods
can’t know which Service they are exposed through (and thus can’t know their stable
IP), so they can’t self-register in other pods using that IP.
284 CHAPTER 10 StatefulSets: deploying replicated stateful applications
Luckily, Kubernetes saves us from resorting to such complex solutions. The proper
clean and simple way of running these special types of applications in Kubernetes is
through a StatefulSet.
NOTE StatefulSets were initially called PetSets. That name comes from the
pets vs. cattle analogy explained here.
We tend to treat our app instances as pets, where we give each instance a name and
take care of each instance individually. But it’s usually better to treat instances as cattle
and not pay special attention to each individual instance. This makes it easy to replace
unhealthy instances without giving it a second thought, similar to the way a farmer
replaces unhealthy cattle.
Instances of a stateless app, for example, behave much like heads of cattle. It
doesn’t matter if an instance dies—you can create a new instance and people won’t
notice the difference.
On the other hand, with stateful apps, an app instance is more like a pet. When a
pet dies, you can’t go buy a new one and expect people not to notice. To replace a lost
pet, you need to find a new one that looks and behaves exactly like the old one. In the
case of apps, this means the new instance needs to have the same state and identity as
the old one.
COMPARING STATEFULSETS WITH REPLICASETS OR REPLICATIONCONTROLLERS
Pod replicas managed by a ReplicaSet or ReplicationController are much like cattle.
Because they’re mostly stateless, they can be replaced with a completely new pod
replica at any time. Stateful pods require a different approach. When a stateful pod
instance dies (or the node it’s running on fails), the pod instance needs to be resur-
rected on another node, but the new instance needs to get the same name, network
identity, and state as the one it’s replacing. This is what happens when the pods are
managed through a StatefulSet.
Understanding StatefulSets 285
A StatefulSet makes sure pods are rescheduled in such a way that they retain their
identity and state. It also allows you to easily scale the number of pets up and down. A
StatefulSet, like a ReplicaSet, has a desired replica count field that determines how
many pets you want running at that time. Similar to ReplicaSets, pods are created from
a pod template specified as part of the StatefulSet (remember the cookie-cutter anal-
ogy?). But unlike pods created by ReplicaSets, pods created by the StatefulSet aren’t
exact replicas of each other. Each can have its own set of volumes—in other words,
storage (and thus persistent state)—which differentiates it from its peers. Pet pods
also have a predictable (and stable) identity instead of each new pod instance getting
a completely random one.
Figure 10.5 Pods created by a StatefulSet have predictable names (and hostnames),
unlike those created by a ReplicaSet
is called A-0, you can reach the pod through its fully qualified domain name, which
is a-0.foo.default.svc.cluster.local. You can’t do that with pods managed by a
ReplicaSet.
Additionally, you can also use DNS to look up all the StatefulSet’s pods’ names by
looking up SRV records for the foo.default.svc.cluster.local domain. We’ll
explain SRV records in section 10.4 and learn how they’re used to discover members
of a StatefulSet.
REPLACING LOST PETS
When a pod instance managed by a StatefulSet disappears (because the node the pod
was running on has failed, it was evicted from the node, or someone deleted the pod
object manually), the StatefulSet makes sure it’s replaced with a new instance—similar
to how ReplicaSets do it. But in contrast to ReplicaSets, the replacement pod gets the
same name and hostname as the pod that has disappeared (this distinction between
ReplicaSets and StatefulSets is illustrated in figure 10.6).
Node 1 fails
Pod A-1
StatefulSet A StatefulSet A
Node 1 fails
Pod B-rsqkw
ReplicaSet B ReplicaSet B
Figure 10.6 A StatefulSet replaces a lost pod with a new one with the same identity, whereas a
ReplicaSet replaces it with a completely new unrelated pod.
Understanding StatefulSets 287
The new pod isn’t necessarily scheduled to the same node, but as you learned early
on, what node a pod runs on shouldn’t matter. This holds true even for stateful pods.
Even if the pod is scheduled to a different node, it will still be available and reachable
under the same hostname as before.
SCALING A STATEFULSET
Scaling the StatefulSet creates a new pod instance with the next unused ordinal index.
If you scale up from two to three instances, the new instance will get index 2 (the exist-
ing instances obviously have indexes 0 and 1).
The nice thing about scaling down a StatefulSet is the fact that you always know
what pod will be removed. Again, this is also in contrast to scaling down a ReplicaSet,
where you have no idea what instance will be deleted, and you can’t even specify
which one you want removed first (but this feature may be introduced in the future).
Scaling down a StatefulSet always removes the instances with the highest ordinal index
first (shown in figure 10.7). This makes the effects of a scale-down predictable.
Figure 10.7 Scaling down a StatefulSet always removes the pod with the highest ordinal index first.
Because certain stateful applications don’t handle rapid scale-downs nicely, Stateful-
Sets scale down only one pod instance at a time. A distributed data store, for example,
may lose data if multiple nodes go down at the same time. For example, if a replicated
data store is configured to store two copies of each data entry, in cases where two
nodes go down at the same time, a data entry would be lost if it was stored on exactly
those two nodes. If the scale-down was sequential, the distributed data store has time
to create an additional replica of the data entry somewhere else to replace the (single)
lost copy.
For this exact reason, StatefulSets also never permit scale-down operations if any of
the instances are unhealthy. If an instance is unhealthy, and you scale down by one at
the same time, you’ve effectively lost two cluster members at once.
Obviously, storage for stateful pods needs to be persistent and decoupled from
the pods. In chapter 6 you learned about PersistentVolumes and PersistentVolume-
Claims, which allow persistent storage to be attached to a pod by referencing the
PersistentVolumeClaim in the pod by name. Because PersistentVolumeClaims map
to PersistentVolumes one-to-one, each pod of a StatefulSet needs to reference a dif-
ferent PersistentVolumeClaim to have its own separate PersistentVolume. Because
all pod instances are stamped from the same pod template, how can they each refer
to a different PersistentVolumeClaim? And who creates these claims? Surely you’re
not expected to create as many PersistentVolumeClaims as the number of pods you
plan to have in the StatefulSet upfront? Of course not.
TEAMING UP POD TEMPLATES WITH VOLUME CLAIM TEMPLATES
The StatefulSet has to create the PersistentVolumeClaims as well, the same way it’s cre-
ating the pods. For this reason, a StatefulSet can also have one or more volume claim
templates, which enable it to stamp out PersistentVolumeClaims along with each pod
instance (see figure 10.8).
Volume claim
template Pod A-2 PVC A-2 PV
The PersistentVolumes for the claims can either be provisioned up-front by an admin-
istrator or just in time through dynamic provisioning of PersistentVolumes, as explained
at the end of chapter 6.
UNDERSTANDING THE CREATION AND DELETION OF PERSISTENTVOLUMECLAIMS
Scaling up a StatefulSet by one creates two or more API objects (the pod and one or
more PersistentVolumeClaims referenced by the pod). Scaling down, however, deletes
only the pod, leaving the claims alone. The reason for this is obvious, if you consider
what happens when a claim is deleted. After a claim is deleted, the PersistentVolume it
was bound to gets recycled or deleted and its contents are lost.
Because stateful pods are meant to run stateful applications, which implies that the
data they store in the volume is important, deleting the claim on scale-down of a Stateful-
Set could be catastrophic—especially since triggering a scale-down is as simple as
decreasing the replicas field of the StatefulSet. For this reason, you’re required to
delete PersistentVolumeClaims manually to release the underlying PersistentVolume.
Understanding StatefulSets 289
PV PV PV PV PV PV
Scale Scale
down up
StatefulSet A StatefulSet A StatefulSet A
Replicas: 2 Replicas: 1 Replicas: 2
Figure 10.9 StatefulSets don’t delete PersistentVolumeClaims when scaling down; then they
reattach them when scaling back up.
so two processes with the same identity would be writing over the same files. With pods
managed by a ReplicaSet, this isn’t a problem, because the apps are obviously made to
work on the same files. Also, ReplicaSets create pods with a randomly generated iden-
tity, so there’s no way for two processes to run with the same identity.
INTRODUCING STATEFULSET’S AT-MOST-ONE SEMANTICS
Kubernetes must thus take great care to ensure two stateful pod instances are never
running with the same identity and are bound to the same PersistentVolumeClaim. A
StatefulSet must guarantee at-most-one semantics for stateful pod instances.
This means a StatefulSet must be absolutely certain that a pod is no longer run-
ning before it can create a replacement pod. This has a big effect on how node fail-
ures are handled. We’ll demonstrate this later in the chapter. Before we can do that,
however, you need to create a StatefulSet and see how it behaves. You’ll also learn a
few more things about them along the way.
...
const dataFile = "/var/data/kubia.txt";
...
var handler = function(request, response) {
if (request.method == 'POST') {
var file = fs.createWriteStream(dataFile);
file.on('open', function (fd) { On POST
request.pipe(file); requests, store
console.log("New data has been received and stored.");
the request’s
body into a
response.writeHead(200);
data file.
response.end("Data stored on pod " + os.hostname() + "\n");
});
} else {
var data = fileExists(dataFile) On GET (and all
? fs.readFileSync(dataFile, 'utf8') other types of)
: "No data posted yet"; requests, return
response.writeHead(200); your hostname
response.write("You've hit " + os.hostname() + "\n"); and the contents
response.end("Data stored on this pod: " + data + "\n"); of the data file.
}
};
Using a StatefulSet 291
Whenever the app receives a POST request, it writes the data it receives in the body of
the request to the file /var/data/kubia.txt. Upon a GET request, it returns the host-
name and the stored data (contents of the file). Simple enough, right? This is the first
version of your app. It’s not clustered yet, but it’s enough to get you started. You’ll
expand the app later in the chapter.
The Dockerfile for building the container image is shown in the following listing
and hasn’t changed from before.
FROM node:7
ADD app.js /app.js
ENTRYPOINT ["node", "app.js"]
Go ahead and build the image now, or use the one I pushed to docker.io/luksa/kubia-pet.
For each pod instance, the StatefulSet will create a PersistentVolumeClaim that will
bind to a PersistentVolume. If your cluster supports dynamic provisioning, you don’t
need to create any PersistentVolumes manually (you can skip the next section). If it
doesn’t, you’ll need to create them as explained in the next section.
CREATING THE PERSISTENT VOLUMES
You’ll need three PersistentVolumes, because you’ll be scaling the StatefulSet up to
three replicas. You must create more if you plan on scaling the StatefulSet up more
than that.
If you’re using Minikube, deploy the PersistentVolumes defined in the Chapter06/
persistent-volumes-hostpath.yaml file in the book’s code archive.
If you’re using Google Kubernetes Engine, you’ll first need to create the actual
GCE Persistent Disks like this:
$ gcloud compute disks create --size=1GiB --zone=europe-west1-b pv-a
$ gcloud compute disks create --size=1GiB --zone=europe-west1-b pv-b
$ gcloud compute disks create --size=1GiB --zone=europe-west1-b pv-c
NOTE Make sure to create the disks in the same zone that your nodes are
running in.
292 CHAPTER 10 StatefulSets: deploying replicated stateful applications
kind: List
apiVersion: v1 File describes a list
items: of three persistent
- apiVersion: v1 volumes
kind: PersistentVolume
metadata:
Persistent volumes’ names
name: pv-a
are pv-a, pv-b, and pv-c
spec:
capacity:
Capacity of each persistent When the volume
storage: 1Mi
volume is 1 Mebibyte is released by the
accessModes:
- ReadWriteOnce claim, it’s recycled
persistentVolumeReclaimPolicy: Recycle to be used again.
gcePersistentDisk:
pdName: pv-a
fsType: nfs4
The volume uses a GCE
Persistent Disk as the underlying
- apiVersion: v1
storage mechanism.
kind: PersistentVolume
metadata:
name: pv-b
...
NOTE In the previous chapter you specified multiple resources in the same
YAML by delimiting them with a three-dash line. Here you’re using a differ-
ent approach by defining a List object and listing the resources as items of
the object. Both methods are equivalent.
This manifest creates PersistentVolumes called pv-a, pv-b, and pv-c. They use GCE Per-
sistent Disks as the underlying storage mechanism, so they’re not appropriate for clus-
ters that aren’t running on Google Kubernetes Engine or Google Compute Engine. If
you’re running the cluster elsewhere, you must modify the PersistentVolume definition
and use an appropriate volume type, such as NFS (Network File System), or similar.
CREATING THE GOVERNING SERVICE
As explained earlier, before deploying a StatefulSet, you first need to create a headless
Service, which will be used to provide the network identity for your stateful pods. The
following listing shows the Service manifest.
apiVersion: v1
kind: Service Name of the
metadata: Service
name: kubia
spec: The StatefulSet’s governing
clusterIP: None Service must be headless.
Using a StatefulSet 293
You’re setting the clusterIP field to None, which makes this a headless Service. It will
enable peer discovery between your pods (you’ll need this later). Once you create the
Service, you can move on to creating the actual StatefulSet.
CREATING THE STATEFULSET MANIFEST
Now you can finally create the StatefulSet. The following listing shows the manifest.
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: kubia
spec:
serviceName: kubia
replicas: 2
template:
metadata:
labels: Pods created by the StatefulSet
app: kubia will have the app=kubia label.
spec:
containers:
- name: kubia
image: luksa/kubia-pet
ports:
- name: http
containerPort: 8080
volumeMounts:
- name: data The container inside the pod will
mountPath: /var/data mount the pvc volume at this path.
volumeClaimTemplates:
- metadata:
name: data
spec:
resources: The PersistentVolumeClaims
requests:
will be created from this
template.
storage: 1Mi
accessModes:
- ReadWriteOnce
The StatefulSet manifest isn’t that different from ReplicaSet or Deployment manifests
you’ve created so far. What’s new is the volumeClaimTemplates list. In it, you’re defin-
ing one volume claim template called data, which will be used to create a Persistent-
VolumeClaim for each pod. As you may remember from chapter 6, a pod references a
claim by including a persistentVolumeClaim volume in the manifest. In the previous
294 CHAPTER 10 StatefulSets: deploying replicated stateful applications
pod template, you’ll find no such volume. The StatefulSet adds it to the pod specifica-
tion automatically and configures the volume to be bound to the claim the StatefulSet
created for the specific pod.
CREATING THE STATEFULSET
You’ll create the StatefulSet now:
$ kubectl create -f kubia-statefulset.yaml
statefulset "kubia" created
See, the first pod is now running, and the second one has been created and is being
started.
EXAMINING THE GENERATED STATEFUL POD
Let’s take a closer look at the first pod’s spec in the following listing to see how the
StatefulSet has constructed the pod from the pod template and the PersistentVolume-
Claim template.
<apiServerHost>:<port>/api/v1/namespaces/default/pods/kubia-0/proxy/<path>
Because the API server is secured, sending requests to pods through the API server is
cumbersome (among other things, you need to pass the authorization token in each
request). Luckily, in chapter 8 you learned how to use kubectl proxy to talk to the
296 CHAPTER 10 StatefulSets: deploying replicated stateful applications
API server without having to deal with authentication and SSL certificates. Run the
proxy again:
$ kubectl proxy
Starting to serve on 127.0.0.1:8001
Now, because you’ll be talking to the API server through the kubectl proxy, you’ll use
localhost:8001 rather than the actual API server host and port. You’ll send a request to
the kubia-0 pod like this:
$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
You've hit kubia-0
Data stored on this pod: No data posted yet
The response shows that the request was indeed received and handled by the app run-
ning in your pod kubia-0.
NOTE If you receive an empty response, make sure you haven’t left out that
last slash character at the end of the URL (or make sure curl follows redirects
by using its -L option).
Because you’re communicating with the pod through the API server, which you’re
connecting to through the kubectl proxy, the request went through two different
proxies (the first was the kubectl proxy and the other was the API server, which prox-
ied the request to the pod). For a clearer picture, examine figure 10.10.
GET localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
GET 192.168.99.100:8443/api/v1/namespaces/default/pods/kubia-0/proxy/
Authorization: Bearer <token>
GET 172.17.0.3:8080/
Figure 10.10 Connecting to a pod through both the kubectl proxy and API server proxy
The request you sent to the pod was a GET request, but you can also send POST
requests through the API server. This is done by sending a POST request to the same
proxy URL as the one you sent the GET request to.
When your app receives a POST request, it stores whatever’s in the request body
into a local file. Send a POST request to the kubia-0 pod:
$ curl -X POST -d "Hey there! This greeting was submitted to kubia-0."
➥ localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
Data stored on pod kubia-0
Using a StatefulSet 297
The data you sent should now be stored in that pod. Let’s see if it returns the stored
data when you perform a GET request again:
$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
You've hit kubia-0
Data stored on this pod: Hey there! This greeting was submitted to kubia-0.
Okay, so far so good. Now let’s see what the other cluster node (the kubia-1 pod)
says:
$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-1/proxy/
You've hit kubia-1
Data stored on this pod: No data posted yet
As expected, each node has its own state. But is that state persisted? Let’s find out.
DELETING A STATEFUL POD TO SEE IF THE RESCHEDULED POD IS REATTACHED TO THE SAME STORAGE
You’re going to delete the kubia-0 pod and wait for it to be rescheduled. Then you’ll
see if it’s still serving the same data as before:
If you list the pods, you’ll see that the pod is terminating:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-0 1/1 Terminating 0 3m
kubia-1 1/1 Running 0 3m
As soon as it terminates successfully, a new pod with the same name is created by the
StatefulSet:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-0 0/1 ContainerCreating 0 6s
kubia-1 1/1 Running 0 4m
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-0 1/1 Running 0 9s
kubia-1 1/1 Running 0 4m
Let me remind you again that this new pod may be scheduled to any node in the clus-
ter, not necessarily the same node that the old pod was scheduled to. The old pod’s
whole identity (the name, hostname, and the storage) is effectively moved to the new
node (as shown in figure 10.11). If you’re using Minikube, you can’t see this because it
only runs a single node, but in a multi-node cluster, you may see the pod scheduled to
a different node than before.
298 CHAPTER 10 StatefulSets: deploying replicated stateful applications
Pod: kubia-0
Figure 10.11 A stateful pod may be rescheduled to a different node, but it retains the name, hostname, and storage.
With the new pod now running, let’s check to see if it has the exact same identity as in
its previous incarnation. The pod’s name is the same, but what about the hostname
and persistent data? You can ask the pod itself to confirm:
$ curl localhost:8001/api/v1/namespaces/default/pods/kubia-0/proxy/
You've hit kubia-0
Data stored on this pod: Hey there! This greeting was submitted to kubia-0.
The pod’s response shows that both the hostname and the data are the same as before,
confirming that a StatefulSet always replaces a deleted pod with what’s effectively the
exact same pod.
SCALING A STATEFULSET
Scaling down a StatefulSet and scaling it back up after an extended time period
should be no different than deleting a pod and having the StatefulSet recreate it
immediately. Remember that scaling down a StatefulSet only deletes the pods, but
leaves the PersistentVolumeClaims untouched. I’ll let you try scaling down the State-
fulSet yourself and confirm this behavior.
The key thing to remember is that scaling down (and up) is performed gradu-
ally—similar to how individual pods are created when the StatefulSet is created ini-
tially. When scaling down by more than one instance, the pod with the highest ordinal
number is deleted first. Only after the pod terminates completely is the pod with the
second highest ordinal number deleted.
EXPOSING STATEFUL PODS THROUGH A REGULAR, NON-HEADLESS SERVICE
Before you move on to the last part of this chapter, you’re going to add a proper, non-
headless Service in front of your pods, because clients usually connect to the pods
through a Service rather than connecting directly.
Discovering peers in a StatefulSet 299
You know how to create the Service by now, but in case you don’t, the following list-
ing shows the manifest.
Listing 10.7 A regular Service for accessing the stateful pods: kubia-service-public.yaml
apiVersion: v1
kind: Service
metadata:
name: kubia-public
spec:
selector:
app: kubia
ports:
- port: 80
targetPort: 8080
Because this isn’t an externally exposed Service (it’s a regular ClusterIP Service, not
a NodePort or a LoadBalancer-type Service), you can only access it from inside the
cluster. You’ll need a pod to access it from, right? Not necessarily.
CONNECTING TO CLUSTER-INTERNAL SERVICES THROUGH THE API SERVER
Instead of using a piggyback pod to access the service from inside the cluster, you can
use the same proxy feature provided by the API server to access the service the way
you’ve accessed individual pods.
The URI path for proxy-ing requests to Services is formed like this:
/api/v1/namespaces/<namespace>/services/<service name>/proxy/<path>
Therefore, you can run curl on your local machine and access the service through the
kubectl proxy like this (you ran kubectl proxy earlier and it should still be running):
$ curl localhost:8001/api/v1/namespaces/default/services/kubia-
➥ public/proxy/
You've hit kubia-1
Data stored on this pod: No data posted yet
Likewise, clients (inside the cluster) can use the kubia-public service for storing to
and reading data from your clustered data store. Of course, each request lands on a
random cluster node, so you’ll get the data from a random node each time. You’ll
improve this next.
How can a pod discover its peers without talking to the API? Is there an existing,
well-known technology you can use that makes this possible? How about the Domain
Name System (DNS)? Depending on how much you know about DNS, you probably
understand what an A, CNAME, or MX record is used for. Other lesser-known types of
DNS records also exist. One of them is the SRV record.
INTRODUCING SRV RECORDS
SRV records are used to point to hostnames and ports of servers providing a specific
service. Kubernetes creates SRV records to point to the hostnames of the pods back-
ing a headless service.
You’re going to list the SRV records for your stateful pods by running the dig DNS
lookup tool inside a new temporary pod. This is the command you’ll use:
$ kubectl run -it srvlookup --image=tutum/dnsutils --rm
➥ --restart=Never -- dig SRV kubia.default.svc.cluster.local
...
;; ANSWER SECTION:
k.d.s.c.l. 30 IN SRV 10 33 0 kubia-0.kubia.default.svc.cluster.local.
k.d.s.c.l. 30 IN SRV 10 33 0 kubia-1.kubia.default.svc.cluster.local.
;; ADDITIONAL SECTION:
kubia-0.kubia.default.svc.cluster.local. 30 IN A 172.17.0.4
kubia-1.kubia.default.svc.cluster.local. 30 IN A 172.17.0.6
...
NOTE I’ve had to shorten the actual name to get records to fit into a single
line, so kubia.d.s.c.l is actually kubia.default.svc.cluster.local.
The ANSWER SECTION shows two SRV records pointing to the two pods backing your head-
less service. Each pod also gets its own A record, as shown in ADDITIONAL SECTION.
For a pod to get a list of all the other pods of a StatefulSet, all you need to do is
perform an SRV DNS lookup. In Node.js, for example, the lookup is performed
like this:
dns.resolveSrv("kubia.default.svc.cluster.local", callBackFunction);
You’ll use this command in your app to enable each pod to discover its peers.
Discovering peers in a StatefulSet 301
NOTE The order of the returned SRV records is random, because they all have
the same priority. Don’t expect to always see kubia-0 listed before kubia-1.
...
const dns = require('dns');
addresses.forEach(function (item) {
var requestOptions = { Each pod
host: item.name, pointed to by
port: port, an SRV record is
path: '/data' then contacted
}; to get its data.
httpGet(requestOptions, function (returnedData) {
numResponses++;
response.write("- " + item.name + ": " + returnedData);
response.write("\n");
if (numResponses == addresses.length) {
response.end();
}
});
});
}
});
}
}
};
...
Figure 10.12 shows what happens when a GET request is received by your app. The
server that receives the request first performs a lookup of SRV records for the head-
less kubia service and then sends a GET request to each of the pods backing the ser-
vice (even to itself, which obviously isn’t necessary, but I wanted to keep the code as
simple as possible). It then returns a list of all the nodes along with the data stored on
each of them.
DNS
1. GET /
4. GET /data
curl kubia-0 kubia-1
6. Return collated data
5. GET /data
kubia-2
The container image containing this new version of the app is available at docker.io/
luksa/kubia-pet-peers.
update the StatefulSet, use the kubectl edit command (the patch command would
be another option):
This opens the StatefulSet definition in your default editor. In the definition, change
spec.replicas to 3 and modify the spec.template.spec.containers.image attri-
bute so it points to the new image (luksa/kubia-pet-peers instead of luksa/kubia-
pet). Save the file and exit the editor to update the StatefulSet. Two replicas were
running previously, so you should now see an additional replica called kubia-2 start-
ing. List the pods to confirm:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-0 1/1 Running 0 25m
kubia-1 1/1 Running 0 26m
kubia-2 0/1 ContainerCreating 0 4s
The new pod instance is running the new image. But what about the existing two rep-
licas? Judging from their age, they don’t seem to have been updated. This is expected,
because initially, StatefulSets were more like ReplicaSets and not like Deployments,
so they don’t perform a rollout when the template is modified. You need to delete
the replicas manually and the StatefulSet will bring them up again based on the new
template:
$ kubectl delete po kubia-0 kubia-1
pod "kubia-0" deleted
pod "kubia-1" deleted
Listing 10.10 Writing to the clustered data store through the service
$ curl localhost:8001/api/v1/namespaces/default/services
➥ /kubia-public/proxy/
You've hit kubia-2
Data stored on each cluster node:
- kubia-0.kubia.default.svc.cluster.local: The weather is sweet
- kubia-1.kubia.default.svc.cluster.local: The sun is shining
- kubia-2.kubia.default.svc.cluster.local: No data posted yet
Nice! When a client request reaches one of your cluster nodes, it discovers all its
peers, gathers data from them, and sends all the data back to the client. Even if you
scale the StatefulSet up or down, the pod servicing the client’s request can always find
all the peers running at that time.
The app itself isn’t that useful, but I hope you found it a fun way to show how
instances of a replicated stateful app can discover their peers and handle horizontal
scaling with ease.
Your ssh session will stop working, so you’ll need to open another terminal to continue.
CHECKING THE NODE’S STATUS AS SEEN BY THE KUBERNETES MASTER
With the node’s network interface down, the Kubelet running on the node can no
longer contact the Kubernetes API server and let it know that the node and all its pods
are still running.
After a while, the control plane will mark the node as NotReady. You can see this
when listing nodes, as the following listing shows.
Because the control plane is no longer getting status updates from the node, the
status of all pods on that node is Unknown. This is shown in the pod list in the follow-
ing listing.
Listing 10.13 Observing the pod’s status change after its node becomes NotReady
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-0 1/1 Unknown 0 15m
kubia-1 1/1 Running 0 14m
kubia-2 1/1 Running 0 13m
As you can see, the kubia-0 pod’s status is no longer known because the pod was (and
still is) running on the node whose network interface you shut down.
UNDERSTANDING WHAT HAPPENS TO PODS WHOSE STATUS IS UNKNOWN
If the node were to come back online and report its and its pod statuses again, the pod
would again be marked as Running. But if the pod’s status remains unknown for more
than a few minutes (this time is configurable), the pod is automatically evicted from
the node. This is done by the master (the Kubernetes control plane). It evicts the pod
by deleting the pod resource.
When the Kubelet sees that the pod has been marked for deletion, it starts ter-
minating the pod. In your case, the Kubelet can no longer reach the master (because
you disconnected the node from the network), which means the pod will keep
running.
306 CHAPTER 10 StatefulSets: deploying replicated stateful applications
Let’s examine the current situation. Use kubectl describe to display details about
the kubia-0 pod, as shown in the following listing.
Listing 10.14 Displaying details of the pod with the unknown status
The pod is shown as Terminating, with NodeLost listed as the reason for the termina-
tion. The message says the node is considered lost because it’s unresponsive.
NOTE What’s shown here is the control plane’s view of the world. In reality,
the pod’s container is still running perfectly fine. It isn’t terminating at all.
All done, right? By deleting the pod, the StatefulSet should immediately create a
replacement pod, which will get scheduled to one of the remaining nodes. List the
pods again to confirm:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-0 1/1 Unknown 0 15m
kubia-1 1/1 Running 0 14m
kubia-2 1/1 Running 0 13m
That’s strange. You deleted the pod a moment ago and kubectl said it had deleted it.
Why is the same pod still there?
NOTE The kubia-0 pod in the listing isn’t a new pod with the same name—
this is clear by looking at the AGE column. If it were new, its age would be
merely a few seconds.
Summary 307
You need to use both the --force and --grace-period 0 options. The warning dis-
played by kubectl notifies you of what you did. If you list the pods again, you’ll finally
see a new kubia-0 pod created:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kubia-0 0/1 ContainerCreating 0 8s
kubia-1 1/1 Running 0 20m
kubia-2 1/1 Running 0 19m
WARNING Don’t delete stateful pods forcibly unless you know the node is no
longer running or is unreachable (and will remain so forever).
Before continuing, you may want to bring the node you disconnected back online.
You can do that by restarting the node through the GCE web console or in a terminal
by issuing the following command:
10.6 Summary
This concludes the chapter on using StatefulSets to deploy stateful apps. This chapter
has shown you how to
Give replicated pods individual storage
Provide a stable identity to a pod
Create a StatefulSet and a corresponding headless governing Service
Scale and update a StatefulSet
Discover other members of the StatefulSet through DNS
308 CHAPTER 10 StatefulSets: deploying replicated stateful applications
Now that you know the major building blocks you can use to have Kubernetes run and
manage your apps, we can look more closely at how it does that. In the next chapter,
you’ll learn about the individual components that control the Kubernetes cluster and
keep your apps running.
Understanding
Kubernetes internals
By reading this book up to this point, you’ve become familiar with what Kubernetes
has to offer and what it does. But so far, I’ve intentionally not spent much time
explaining exactly how it does all this because, in my opinion, it makes no sense to
go into details of how a system works until you have a good understanding of what
the system does. That’s why we haven’t talked about exactly how a pod is scheduled
or how the various controllers running inside the Controller Manager make deployed
resources come to life. Because you now know most resources that can be deployed in
Kubernetes, it’s time to dive into how they’re implemented.
309
310 CHAPTER 11 Understanding Kubernetes internals
Let’s look more closely at what these two parts do and what’s running inside them.
COMPONENTS OF THE CONTROL PLANE
The Control Plane is what controls and makes the whole cluster function. To refresh
your memory, the components that make up the Control Plane are
The etcd distributed persistent storage
The API server
The Scheduler
The Controller Manager
These components store and manage the state of the cluster, but they aren’t what runs
the application containers.
COMPONENTS RUNNING ON THE WORKER NODES
The task of running your containers is up to the components running on each
worker node:
The Kubelet
The Kubernetes Service Proxy (kube-proxy)
The Container Runtime (Docker, rkt, or others)
ADD-ON COMPONENTS
Beside the Control Plane components and the components running on the nodes, a
few add-on components are required for the cluster to provide everything discussed
so far. This includes
The Kubernetes DNS server
The Dashboard
An Ingress controller
Heapster, which we’ll talk about in chapter 14
The Container Network Interface network plugin (we’ll explain it later in this
chapter)
kube-proxy
Kubelet
Controller
Scheduler
Manager Controller Figure 11.1 Kubernetes
Runtime components of the Control
Plane and the worker nodes
To get all the features Kubernetes provides, all these components need to be running.
But several can also perform useful work individually without the other components.
You’ll see how as we examine each of them.
can be more than one instance of each Control Plane component running to ensure
high availability. While multiple instances of etcd and API server can be active at the
same time and do perform their jobs in parallel, only a single instance of the Sched-
uler and the Controller Manager may be active at a given time—with the others in
standby mode.
HOW COMPONENTS ARE RUN
The Control Plane components, as well as kube-proxy, can either be deployed on the
system directly or they can run as pods (as shown in listing 11.1). You may be surprised
to hear this, but it will all make sense later when we talk about the Kubelet.
The Kubelet is the only component that always runs as a regular system compo-
nent, and it’s the Kubelet that then runs all the other components as pods. To run the
Control Plane components as pods, the Kubelet is also deployed on the master. The
next listing shows pods in the kube-system namespace in a cluster created with
kubeadm, which is explained in appendix B.
As you can see in the listing, all the Control Plane components are running as pods on
the master node. There are three worker nodes, and each one runs the kube-proxy
and a Flannel pod, which provides the overlay network for the pods (we’ll talk about
Flannel later).
TIP As shown in the listing, you can tell kubectl to display custom columns
with the -o custom-columns option and sort the resource list with --sort-by.
Now, let’s look at each of the components up close, starting with the lowest level com-
ponent of the Control Plane—the persistent storage.
which is a fast, distributed, and consistent key-value store. Because it’s distributed,
you can run more than one etcd instance to provide both high availability and bet-
ter performance.
The only component that talks to etcd directly is the Kubernetes API server. All
other components read and write data to etcd indirectly through the API server. This
brings a few benefits, among them a more robust optimistic locking system as well as
validation; and, by abstracting away the actual storage mechanism from all the other
components, it’s much simpler to replace it in the future. It’s worth emphasizing that
etcd is the only place Kubernetes stores cluster state and metadata.
$ etcdctl ls /registry
/registry/configmaps
/registry/daemonsets
/registry/deployments
/registry/events
/registry/namespaces
/registry/pods
...
314 CHAPTER 11 Understanding Kubernetes internals
You’ll recognize that these keys correspond to the resource types you learned about in
the previous chapters.
NOTE If you’re using v3 of the etcd API, you can’t use the ls command to see
the contents of a directory. Instead, you can list all keys that start with a given
prefix with etcdctl get /registry --prefix=true.
The following listing shows the contents of the /registry/pods directory.
$ etcdctl ls /registry/pods
/registry/pods/default
/registry/pods/kube-system
As you can infer from the names, these two entries correspond to the default and the
kube-system namespaces, which means pods are stored per namespace. The follow-
ing listing shows the entries in the /registry/pods/default directory.
$ etcdctl ls /registry/pods/default
/registry/pods/default/kubia-159041347-xk0vc
/registry/pods/default/kubia-159041347-wt6ga
/registry/pods/default/kubia-159041347-hp2o5
Each entry corresponds to an individual pod. These aren’t directories, but key-value
entries. The following listing shows what’s stored in one of them.
You’ll recognize that this is nothing other than a pod definition in JSON format. The
API server stores the complete JSON representation of a resource in etcd. Because of
etcd’s hierarchical key space, you can think of all the stored resources as JSON files in
a filesystem. Simple, right?
WARNING Prior to Kubernetes version 1.7, the JSON manifest of a Secret
resource was also stored like this (it wasn’t encrypted). If someone got direct
access to etcd, they knew all your Secrets. From version 1.7, Secrets are
encrypted and thus stored much more securely.
ENSURING THE CONSISTENCY AND VALIDITY OF STORED OBJECTS
Remember Google’s Borg and Omega systems mentioned in chapter 1, which are
what Kubernetes is based on? Like Kubernetes, Omega also uses a centralized store to
hold the state of the cluster, but in contrast, multiple Control Plane components
access the store directly. All these components need to make sure they all adhere to
Understanding the architecture 315
the same optimistic locking mechanism to handle conflicts properly. A single compo-
nent not adhering fully to the mechanism may lead to inconsistent data.
Kubernetes improves this by requiring all other Control Plane components to go
through the API server. This way updates to the cluster state are always consistent, because
the optimistic locking mechanism is implemented in a single place, so less chance exists,
if any, of error. The API server also makes sure that the data written to the store is always
valid and that changes to the data are only performed by authorized clients.
ENSURING CONSISTENCY WHEN ETCD IS CLUSTERED
For ensuring high availability, you’ll usually run more than a single instance of etcd.
Multiple etcd instances will need to remain consistent. Such a distributed system
needs to reach a consensus on what the actual state is. etcd uses the RAFT consensus
algorithm to achieve this, which ensures that at any given moment, each node’s state is
either what the majority of the nodes agrees is the current state or is one of the previ-
ously agreed upon states.
Clients connecting to different nodes of an etcd cluster will either see the actual
current state or one of the states from the past (in Kubernetes, the only etcd client is
the API server, but there may be multiple instances).
The consensus algorithm requires a majority (or quorum) for the cluster to progress
to the next state. As a result, if the cluster splits into two disconnected groups of nodes,
the state in the two groups can never diverge, because to transition from the previous
state to the new one, there needs to be more than half of the nodes taking part in
the state change. If one group contains the majority of all nodes, the other one obvi-
ously doesn’t. The first group can modify the cluster state, whereas the other one can’t.
When the two groups reconnect, the second group can catch up with the state in the
first group (see figure 11.2).
Figure 11.2 In a split-brain scenario, only the side which still has the majority (quorum) accepts
state changes.
316 CHAPTER 11 Understanding Kubernetes internals
API server
NOTE When the request is only trying to read data, the request doesn’t go
through the Admission Control.
You’ll find a list of additional Admission Control plugins in the Kubernetes documen-
tation at https://kubernetes.io/docs/admin/admission-controllers/.
318 CHAPTER 11 Understanding Kubernetes internals
kubectl
2. POST /.../pods/pod-xyz
3. Update object
Various 1. GET /.../pods?watch=true in etcd
clients API server etcd
5. Send updated object 4. Modification
to all watchers notification
Figure 11.4 When an object is updated, the API server sends the updated object to all interested
watchers.
One of the API server’s clients is the kubectl tool, which also supports watching
resources. For example, when deploying a pod, you don’t need to constantly poll the list
of pods by repeatedly executing kubectl get pods. Instead, you can use the --watch
flag and be notified of each creation, modification, or deletion of a pod, as shown in
the following listing.
You can even have kubectl print out the whole YAML on each watch event like this:
The watch mechanism is also used by the Scheduler, which is the next Control Plane
component you’re going to learn more about.
Figure 11.5 The Scheduler finds acceptable nodes for a pod and then selects the best node
for the pod.
Or is it? If these two nodes are provided by the cloud infrastructure, it may be bet-
ter to schedule the pod to the first node and relinquish the second node back to the
cloud provider to save money.
ADVANCED SCHEDULING OF PODS
Consider another example. Imagine having multiple replicas of a pod. Ideally, you’d
want them spread across as many nodes as possible instead of having them all sched-
uled to a single one. Failure of that node would cause the service backed by those
pods to become unavailable. But if the pods were spread across different nodes, a sin-
gle node failure would barely leave a dent in the service’s capacity.
Pods belonging to the same Service or ReplicaSet are spread across multiple nodes
by default. It’s not guaranteed that this is always the case. But you can force pods to be
spread around the cluster or kept close together by defining pod affinity and anti-
affinity rules, which are explained in chapter 16.
Even these two simple cases show how complex scheduling can be, because it
depends on a multitude of factors. Because of this, the Scheduler can either be config-
ured to suit your specific needs or infrastructure specifics, or it can even be replaced
with a custom implementation altogether. You could also run a Kubernetes cluster
without a Scheduler, but then you’d have to perform the scheduling manually.
USING MULTIPLE SCHEDULERS
Instead of running a single Scheduler in the cluster, you can run multiple Schedulers.
Then, for each pod, you specify the Scheduler that should schedule this particular
pod by setting the schedulerName property in the pod spec.
Pods without this property set are scheduled using the default Scheduler, and so
are pods with schedulerName set to default-scheduler. All other pods are ignored by
the default Scheduler, so they need to be scheduled either manually or by another
Scheduler watching for such pods.
You can implement your own Schedulers and deploy them in the cluster, or you
can deploy an additional instance of Kubernetes’ Scheduler with different configura-
tion options.
Deployment controller
StatefulSet controller
Node controller
Service controller
Endpoints controller
Namespace controller
PersistentVolume controller
Others
What each of these controllers does should be evident from its name. From the list,
you can tell there’s a controller for almost every resource you can create. Resources
are descriptions of what should be running in the cluster, whereas the controllers are
the active Kubernetes components that perform actual work as a result of the deployed
resources.
UNDERSTANDING WHAT CONTROLLERS DO AND HOW THEY DO IT
Controllers do many different things, but they all watch the API server for changes to
resources (Deployments, Services, and so on) and perform operations for each change,
whether it’s a creation of a new object or an update or deletion of an existing object.
Most of the time, these operations include creating other resources or updating the
watched resources themselves (to update the object’s status, for example).
In general, controllers run a reconciliation loop, which reconciles the actual state
with the desired state (specified in the resource’s spec section) and writes the new
actual state to the resource’s status section. Controllers use the watch mechanism to
be notified of changes, but because using watches doesn’t guarantee the controller
won’t miss an event, they also perform a re-list operation periodically to make sure
they haven’t missed anything.
Controllers never talk to each other directly. They don’t even know any other con-
trollers exist. Each controller connects to the API server and, through the watch
mechanism described in section 11.1.3, asks to be notified when a change occurs in
the list of resources of any type the controller is responsible for.
We’ll briefly look at what each of the controllers does, but if you’d like an in-depth
view of what they do, I suggest you look at their source code directly. The sidebar
explains how to get started.
an Informer listens for changes to a specific type of resource. Looking at the con-
structor will show you which resources the controller is watching.
Next, go look for the worker() method. In it, you’ll find the method that gets invoked
each time the controller needs to do something. The actual function is often stored
in a field called syncHandler or something similar. This field is also initialized in the
constructor, so that’s where you’ll find the name of the function that gets called. That
function is the place where all the magic happens.
Watches ReplicationController
resources
Replication
Manager
Pod resources
Creates and
deletes Other resources
new Pod manifests, posts them to the API server, and lets the Scheduler and the
Kubelet do their job of scheduling and running the pod.
The Replication Manager performs its work by manipulating Pod API objects
through the API server. This is how all controllers operate.
THE REPLICASET, THE DAEMONSET, AND THE JOB CONTROLLERS
The ReplicaSet controller does almost the same thing as the Replication Manager
described previously, so we don’t have much to add here. The DaemonSet and Job
controllers are similar. They create Pod resources from the pod template defined in
their respective resources. Like the Replication Manager, these controllers don’t run
the pods, but post Pod definitions to the API server, letting the Kubelet create their
containers and run them.
THE DEPLOYMENT CONTROLLER
The Deployment controller takes care of keeping the actual state of a deployment in
sync with the desired state specified in the corresponding Deployment API object.
The Deployment controller performs a rollout of a new version each time a
Deployment object is modified (if the modification should affect the deployed pods).
It does this by creating a ReplicaSet and then appropriately scaling both the old and
the new ReplicaSet based on the strategy specified in the Deployment, until all the old
pods have been replaced with new ones. It doesn’t create any pods directly.
THE STATEFULSET CONTROLLER
The StatefulSet controller, similarly to the ReplicaSet controller and other related
controllers, creates, manages, and deletes Pods according to the spec of a StatefulSet
resource. But while those other controllers only manage Pods, the StatefulSet control-
ler also instantiates and manages PersistentVolumeClaims for each Pod instance.
THE NODE CONTROLLER
The Node controller manages the Node resources, which describe the cluster’s worker
nodes. Among other things, a Node controller keeps the list of Node objects in sync
with the actual list of machines running in the cluster. It also monitors each node’s
health and evicts pods from unreachable nodes.
The Node controller isn’t the only component making changes to Node objects.
They’re also changed by the Kubelet, and can obviously also be modified by users
through REST API calls.
THE SERVICE CONTROLLER
In chapter 5, when we talked about Services, you learned that a few different types
exist. One of them was the LoadBalancer service, which requests a load balancer from
the infrastructure to make the service available externally. The Service controller is
the one requesting and releasing a load balancer from the infrastructure, when a
LoadBalancer-type Service is created or deleted.
Understanding the architecture 325
Watches
Service resources
Endpoints
controller
Pod resources
Endpoints resources
Creates, modifies,
and deletes
Figure 11.7 The Endpoints controller watches Service and Pod resources,
and manages Endpoints.
in the claim. It does this by keeping an ordered list of PersistentVolumes for each
access mode by ascending capacity and returning the first volume from the list.
Then, when the user deletes the PersistentVolumeClaim, the volume is unbound
and reclaimed according to the volume’s reclaim policy (left as is, deleted, or emptied).
CONTROLLER WRAP-UP
You should now have a good feel for what each controller does and how controllers
work in general. Again, all these controllers operate on the API objects through the
API server. They don’t communicate with the Kubelets directly or issue any kind of
instructions to them. In fact, they don’t even know Kubelets exist. After a controller
updates a resource in the API server, the Kubelets and Kubernetes Service Proxies,
also oblivious of the controllers’ existence, perform their work, such as spinning up a
pod’s containers and attaching network storage to them, or in the case of services, set-
ting up the actual load balancing across pods.
The Control Plane handles one part of the operation of the whole system, so to
fully understand how things unfold in a Kubernetes cluster, you also need to under-
stand what the Kubelet and the Kubernetes Service Proxy do. We’ll learn that next.
Runs, monitors,
and manages
containers Container Runtime
Pod resource Kubelet (Docker, rkt, ...)
Container A
Container A
Container B
Container B Pod manifest (file)
Container C
Container C
Figure 11.8 The Kubelet runs pods based on pod specs from the API server and a local file directory.
them. You can also use the same method to run your custom system containers, but
doing it through a DaemonSet is the recommended method.
Configures iptables:
redirect through proxy server
The kube-proxy got its name because it was an actual proxy, but the current, much
better performing implementation only uses iptables rules to redirect packets to a
randomly selected backend pod without passing them through an actual proxy server.
This mode is called the iptables proxy mode and is shown in figure 11.10.
Configures iptables:
redirect straight to pod
(no proxy server in-between)
kube-proxy
The major difference between these two modes is whether packets pass through the
kube-proxy and must be handled in user space, or whether they’re handled only by
the Kernel (in kernel space). This has a major impact on performance.
Another smaller difference is that the userspace proxy mode balanced connec-
tions across pods in a true round-robin fashion, while the iptables proxy mode
doesn’t—it selects pods randomly. When only a few clients use a service, they may not
be spread evenly across pods. For example, if a service has two backing pods but only
five or so clients, don’t be surprised if you see four clients connect to pod A and only
one client connect to pod B. With a higher number of clients or pods, this problem
isn’t so apparent.
You’ll learn exactly how iptables proxy mode works in section 11.5.
Other add-ons are similar. They all need to observe the cluster state and perform
the necessary actions when that changes. We’ll introduce a few other add-ons in this
and the remaining chapters.
ReplicaSet Watches
ReplicaSets Docker
controller
Watches
Scheduler Pods
Figure 11.11 Kubernetes components watching API objects through the API server
How controllers cooperate 331
Master node
1. Creates Deployment
resource
kubectl
API server Node X
8. Notification
Controller
Deployments through watch
Manager 2. Notification Kubelet
through watch
Deployment Deployment A
controller 9. Tells Docker to
3. Creates
ReplicaSet run containers
ReplicaSets
4. Notification Docker
ReplicaSet ReplicaSet A
controller
5. Creates pod
10. Runs
containers
Pods
6. Notification
through watch
Scheduler Pod A Container(s)
Figure 11.12 The chain of events that unfolds when a Deployment resource is posted to the API server
As you can see, the SOURCE column shows the controller performing the action, and
the NAME and KIND columns show the resource the controller is acting on. The REASON
column and the MESSAGE column (shown in every second line) give more details
about what the controller has done.
You can now ssh into the worker node running the pod and inspect the list of run-
ning Docker containers. I’m using Minikube to test this out, so to ssh into the single
334 CHAPTER 11 Understanding Kubernetes internals
node, I use minikube ssh. If you’re using GKE, you can ssh into a node with gcloud
compute ssh <node name>.
Once you’re inside the node, you can list all the running containers with docker
ps, as shown in the following listing.
docker@minikubeVM:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED
c917a6f3c3f7 nginx "nginx -g 'daemon off" 4 seconds ago
98b8bf797174 gcr.io/.../pause:3.0 "/pause" 7 seconds ago
...
As expected, you see the Nginx container, but also an additional container. Judging
from the COMMAND column, this additional container isn’t doing anything (the con-
tainer’s command is "pause"). If you look closely, you’ll see that this container was
created a few seconds before the Nginx container. What’s its role?
This pause container is the container that holds all the containers of a pod
together. Remember how all containers of a pod share the same network and other
Linux namespaces? The pause container is an infrastructure container whose sole
purpose is to hold all these namespaces. All other user-defined containers of the pod
then use the namespaces of the pod infrastructure container (see figure 11.13).
Pod infrastructure
container
Container A
Uses Linux Uses Linux
namespaces from namespaces from
Container B
Actual application containers may die and get restarted. When such a container starts
up again, it needs to become part of the same Linux namespaces as before. The infra-
structure container makes this possible since its lifecycle is tied to that of the pod—the
container runs from the time the pod is scheduled until the pod is deleted. If the
infrastructure pod is killed in the meantime, the Kubelet recreates it and all the pod’s
containers.
Inter-pod networking 335
No NAT (IPs
are preserved)
Node 1 Node 2
Pod A Pod B
IP: 10.1.1.1 IP: 10.1.2.1
Packet Packet
Network
srcIP: 10.1.1.1
dstIP: 10.1.2.1
This is important, because it makes networking for applications running inside pods
simple and exactly as if they were running on machines connected to the same net-
work switch. The absence of NAT between pods enables applications running inside
them to self-register in other pods.
336 CHAPTER 11 Understanding Kubernetes internals
For example, say you have a client pod X and pod Y, which provides a kind of noti-
fication service to all pods that register with it. Pod X connects to pod Y and tells it,
“Hey, I’m pod X, available at IP 1.2.3.4; please send updates to me at this IP address.”
The pod providing the service can connect to the first pod by using the received
IP address.
The requirement for NAT-less communication between pods also extends to pod-
to-node and node-to-pod communication. But when a pod communicates with ser-
vices out on the internet, the source IP of the packets the pod sends does need to be
changed, because the pod’s IP is private. The source IP of outbound packets is
changed to the host worker node’s IP address.
Building a proper Kubernetes cluster involves setting up the networking according
to these requirements. There are various methods and technologies available to do
this, each with its own benefits or drawbacks in a given scenario. Because of this, we’re
not going to go into specific technologies. Instead, let’s explain how inter-pod net-
working works in general.
Node
Pod A
This is pod A’s
eth0 veth pair.
veth123
10.1.1.1
Bridge
Pod B 10.1.1.0/24
eth0
veth234
10.1.1.2 Figure 11.15 Pods on a node are
This is pod B’s
veth pair. connected to the same bridge through
virtual Ethernet interface pairs.
The interface in the host’s network namespace is attached to a network bridge that
the container runtime is configured to use. The eth0 interface in the container is
assigned an IP address from the bridge’s address range. Anything that an application
running inside the container sends to the eth0 network interface (the one in the con-
tainer’s namespace), comes out at the other veth interface in the host’s namespace
and is sent to the bridge. This means it can be received by any network interface that’s
connected to the bridge.
If pod A sends a network packet to pod B, the packet first goes through pod A’s
veth pair to the bridge and then through pod B’s veth pair. All containers on a node
are connected to the same bridge, which means they can all communicate with each
other. But to enable communication between containers running on different nodes,
the bridges on those nodes need to be connected somehow.
ENABLING COMMUNICATION BETWEEN PODS ON DIFFERENT NODES
You have many ways to connect bridges on different nodes. This can be done with
overlay or underlay networks or by regular layer 3 routing, which we’ll look at next.
You know pod IP addresses must be unique across the whole cluster, so the bridges
across the nodes must use non-overlapping address ranges to prevent pods on differ-
ent nodes from getting the same IP. In the example shown in figure 11.16, the bridge
on node A is using the 10.1.1.0/24 IP range and the bridge on node B is using
10.1.2.0/24, which ensures no IP address conflicts exist.
Figure 11.16 shows that to enable communication between pods across two nodes
with plain layer 3 networking, the node’s physical network interface needs to be con-
nected to the bridge as well. Routing tables on node A need to be configured so all
packets destined for 10.1.2.0/24 are routed to node B, whereas node B’s routing
tables need to be configured so packets sent to 10.1.1.0/24 are routed to node A.
With this type of setup, when a packet is sent by a container on one of the nodes
to a container on the other node, the packet first goes through the veth pair, then
Node A Node B
Pod A Pod C
eth0 eth0
veth123 veth345
10.1.1.1 10.1.2.1
Bridge Bridge
10.1.1.0/24 10.1.2.0/24
Pod B Pod D
Network
Figure 11.16 For pods on different nodes to communicate, the bridges need to be connected
somehow.
338 CHAPTER 11 Understanding Kubernetes internals
through the bridge to the node’s physical adapter, then over the wire to the other
node’s physical adapter, through the other node’s bridge, and finally through the veth
pair of the destination container.
This works only when nodes are connected to the same network switch, without
any routers in between; otherwise those routers would drop the packets because
they refer to pod IPs, which are private. Sure, the routers in between could be con-
figured to route packets between the nodes, but this becomes increasingly difficult
and error-prone as the number of routers between the nodes increases. Because of
this, it’s easier to use a Software Defined Network (SDN), which makes the nodes
appear as though they’re connected to the same network switch, regardless of the
actual underlying network topology, no matter how complex it is. Packets sent
from the pod are encapsulated and sent over the network to the node running the
other pod, where they are de-encapsulated and delivered to the pod in their origi-
nal form.
We’re not going to go into the details of these plugins; if you want to learn more about
them, refer to https://kubernetes.io/docs/concepts/cluster-administration/addons/.
Installing a network plugin isn’t difficult. You only need to deploy a YAML con-
taining a DaemonSet and a few other supporting resources. This YAML is provided
on each plugin’s project page. As you can imagine, the DaemonSet is used to deploy
a network agent on all cluster nodes. It then ties into the CNI interface on the node,
but be aware that the Kubelet needs to be started with --network-plugin=cni to
use CNI.
API server
Service B
172.30.0.1:80 Endpoints B
Node A Node B
Configures
iptables
Source:
Source:
10.1.1.1
10.1.1.1 Pod B3
Destination:
Destination: IP: 10.1.2.2
172.30.0.1:80
172.30.0.1:80 iptables 10.1.2.1:8080
Packet X Packet X
Figure 11.17 Network packets sent to a Service’s virtual IP/port pair are
modified and redirected to a randomly selected backend pod.
packet is first handled by node A’s kernel according to the iptables rules set up on
the node.
The kernel checks if the packet matches any of those iptables rules. One of them
says that if any packet has the destination IP equal to 172.30.0.1 and destination port
equal to 80, the packet’s destination IP and port should be replaced with the IP and
port of a randomly selected pod.
The packet in the example matches that rule and so its destination IP/port is
changed. In the example, pod B2 was randomly selected, so the packet’s destination
IP is changed to 10.1.2.1 (pod B2’s IP) and the port to 8080 (the target port specified
in the Service spec). From here on, it’s exactly as if the client pod had sent the packet
to pod B directly instead of through the service.
It’s slightly more complicated than that, but that’s the most important part you
need to understand.
Running highly available clusters 341
Without going into the actual details of how to install and run these components, let’s
see what’s involved in making each of these components highly available. Figure 11.18
shows an overview of a highly available cluster.
Load
balancer Controller Controller Controller
Scheduler Scheduler Scheduler
Manager Manager Manager
node failures, respectively. Having more than seven etcd instances is almost never nec-
essary and begins impacting performance.
RUNNING MULTIPLE INSTANCES OF THE API SERVER
Making the API server highly available is even simpler. Because the API server is (almost
completely) stateless (all the data is stored in etcd, but the API server does cache it), you
can run as many API servers as you need, and they don’t need to be aware of each other
at all. Usually, one API server is collocated with every etcd instance. By doing this, the
etcd instances don’t need any kind of load balancer in front of them, because every API
server instance only talks to the local etcd instance.
The API servers, on the other hand, do need to be fronted by a load balancer, so
clients (kubectl, but also the Controller Manager, Scheduler, and all the Kubelets)
always connect only to the healthy API server instances.
ENSURING HIGH AVAILABILITY OF THE CONTROLLERS AND THE SCHEDULER
Compared to the API server, where multiple replicas can run simultaneously, run-
ning multiple instances of the Controller Manager or the Scheduler isn’t as simple.
Because controllers and the Scheduler all actively watch the cluster state and act when
it changes, possibly modifying the cluster state further (for example, when the desired
replica count on a ReplicaSet is increased by one, the ReplicaSet controller creates an
additional pod), running multiple instances of each of those components would
result in all of them performing the same action. They’d be racing each other, which
could cause undesired effects (creating two new pods instead of one, as mentioned in
the previous example).
For this reason, when running multiple instances of these components, only one
instance may be active at any given time. Luckily, this is all taken care of by the compo-
nents themselves (this is controlled with the --leader-elect option, which defaults to
true). Each individual component will only be active when it’s the elected leader. Only
the leader performs actual work, whereas all other instances are standing by and waiting
for the current leader to fail. When it does, the remaining instances elect a new leader,
which then takes over the work. This mechanism ensures that two components are never
operating at the same time and doing the same work (see figure 11.19).
Figure 11.19 Only a single Controller Manager and a single Scheduler are active; others are standing by.
344 CHAPTER 11 Understanding Kubernetes internals
The Controller Manager and Scheduler can run collocated with the API server and
etcd, or they can run on separate machines. When collocated, they can talk to the
local API server directly; otherwise they connect to the API servers through the load
balancer.
UNDERSTANDING THE LEADER ELECTION MECHANISM USED IN CONTROL PLANE COMPONENTS
What I find most interesting here is that these components don’t need to talk to each
other directly to elect a leader. The leader election mechanism works purely by creat-
ing a resource in the API server. And it’s not even a special kind of resource—the End-
points resource is used to achieve this (abused is probably a more appropriate term).
There’s nothing special about using an Endpoints object to do this. It’s used
because it has no side effects as long as no Service with the same name exists. Any
other resource could be used (in fact, the leader election mechanism will soon use
ConfigMaps instead of Endpoints).
I’m sure you’re interested in how a resource can be used for this purpose. Let’s
take the Scheduler, for example. All instances of the Scheduler try to create (and later
update) an Endpoints resource called kube-scheduler. You’ll find it in the kube-
system namespace, as the following listing shows.
other instances see that the resource hasn’t been updated for a while, and try to become
the leader by writing their own name to the resource. Simple, right?
11.7 Summary
Hopefully, this has been an interesting chapter that has improved your knowledge of
the inner workings of Kubernetes. This chapter has shown you
What components make up a Kubernetes cluster and what each component is
responsible for
How the API server, Scheduler, various controllers running in the Controller
Manager, and the Kubelet work together to bring a pod to life
How the infrastructure container binds together all the containers of a pod
How pods communicate with other pods running on the same node through
the network bridge, and how those bridges on different nodes are connected,
so pods running on different nodes can talk to each other
How the kube-proxy performs load balancing across pods in the same service by
configuring iptables rules on the node
How multiple instances of each component of the Control Plane can be run to
make the cluster highly available
Next, we’ll look at how to secure the API server and, by extension, the cluster as a whole.
Securing the
Kubernetes API server
In chapter 8 you learned how applications running in pods can talk to the API
server to retrieve or change the state of resources deployed in the cluster. To
authenticate with the API server, you used the ServiceAccount token mounted into
the pod. In this chapter, you’ll learn what ServiceAccounts are and how to config-
ure their permissions, as well as permissions for other subjects using the cluster.
346
Understanding authentication 347
plugins, so they can each examine the request and try to determine who’s sending the
request. The first plugin that can extract that information from the request returns
the username, user ID, and the groups the client belongs to back to the API server
core. The API server stops invoking the remaining authentication plugins and contin-
ues onto the authorization phase.
Several authentication plugins are available. They obtain the identity of the client
using the following methods:
From the client certificate
From an authentication token passed in an HTTP header
Basic HTTP authentication
Others
The authentication plugins are enabled through command-line options when starting
the API server.
Both these types of clients are authenticated using the aforementioned authentication
plugins. Users are meant to be managed by an external system, such as a Single Sign
On (SSO) system, but the pods use a mechanism called service accounts, which are cre-
ated and stored in the cluster as ServiceAccount resources. In contrast, no resource
represents user accounts, which means you can’t create, update, or delete users through
the API server.
We won’t go into any details of how to manage users, but we will explore Service-
Accounts in detail, because they’re essential for running pods. For more informa-
tion on how to configure the cluster for authentication of human users, cluster
administrators should refer to the Kubernetes Cluster Administrator guide at http://
kubernetes.io/docs/admin.
UNDERSTANDING GROUPS
Both human users and ServiceAccounts can belong to one or more groups. We’ve said
that the authentication plugin returns groups along with the username and user ID.
Groups are used to grant permissions to several users at once, instead of having to
grant them to individual users.
348 CHAPTER 12 Securing the Kubernetes API server
Groups returned by the plugin are nothing but strings, representing arbitrary
group names, but built-in groups have special meaning:
The system:unauthenticated group is used for requests where none of the
authentication plugins could authenticate the client.
The system:authenticated group is automatically assigned to a user who was
authenticated successfully.
The system:serviceaccounts group encompasses all ServiceAccounts in the
system.
The system:serviceaccounts:<namespace> includes all ServiceAccounts in a
specific namespace.
The API server passes this username to the configured authorization plugins, which
determine whether the action the app is trying to perform is allowed to be performed
by the ServiceAccount.
ServiceAccounts are nothing more than a way for an application running inside a
pod to authenticate itself with the API server. As already mentioned, applications do
that by passing the ServiceAccount’s token in the request.
UNDERSTANDING THE SERVICEACCOUNT RESOURCE
ServiceAccounts are resources just like Pods, Secrets, ConfigMaps, and so on, and are
scoped to individual namespaces. A default ServiceAccount is automatically created
for each namespace (that’s the one your pods have used all along).
You can list ServiceAccounts like you do other resources:
$ kubectl get sa
NAME SECRETS AGE
default 1 1d
As you can see, the current namespace only contains the default ServiceAccount. Addi-
tional ServiceAccounts can be added when required. Each pod is associated with exactly
one ServiceAccount, but multiple pods can use the same ServiceAccount. As you can
see in figure 12.1, a pod can only use a ServiceAccount from the same namespace.
Figure 12.1 Each pod is associated with a single ServiceAccount in the pod’s namespace.
Let’s see how you can create additional ServiceAccounts, how they relate to Secrets,
and how you can assign them to your pods.
CREATING A SERVICEACCOUNT
Creating a ServiceAccount is incredibly easy, thanks to the dedicated kubectl create
serviceaccount command. Let’s create a new ServiceAccount called foo:
Now, you can inspect the ServiceAccount with the describe command, as shown in
the following listing.
You can see that a custom token Secret has been created and associated with the
ServiceAccount. If you look at the Secret’s data with kubectl describe secret foo-
token-qzq7j, you’ll see it contains the same items (the CA certificate, namespace, and
token) as the default ServiceAccount’s token does (the token itself will obviously be
different), as shown in the following listing.
NOTE You’ve probably heard of JSON Web Tokens (JWT). The authentica-
tion tokens used in ServiceAccounts are JWT tokens.
UNDERSTANDING A SERVICEACCOUNT’S MOUNTABLE SECRETS
The token is shown in the Mountable secrets list when you inspect a ServiceAccount
with kubectl describe. Let me explain what that list represents. In chapter 7 you
learned how to create Secrets and mount them inside a pod. By default, a pod can
mount any Secret it wants. But the pod’s ServiceAccount can be configured to only
Understanding authentication 351
allow the pod to mount Secrets that are listed as mountable Secrets on the Service-
Account. To enable this feature, the ServiceAccount must contain the following anno-
tation: kubernetes.io/enforce-mountable-secrets="true".
If the ServiceAccount is annotated with this annotation, any pods using it can mount
only the ServiceAccount’s mountable Secrets—they can’t use any other Secret.
UNDERSTANDING A SERVICEACCOUNT’S IMAGE PULL SECRETS
A ServiceAccount can also contain a list of image pull Secrets, which we examined in
chapter 7. In case you don’t remember, they are Secrets that hold the credentials for
pulling container images from a private image repository.
The following listing shows an example of a ServiceAccount definition, which
includes the image pull Secret you created in chapter 7.
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-service-account
imagePullSecrets:
- name: my-dockerhub-secret
A ServiceAccount’s image pull Secrets behave slightly differently than its mountable
Secrets. Unlike mountable Secrets, they don’t determine which image pull Secrets a
pod can use, but which ones are added automatically to all pods using the Service-
Account. Adding image pull Secrets to a ServiceAccount saves you from having to add
them to each pod individually.
NOTE A pod’s ServiceAccount must be set when creating the pod. It can’t be
changed later.
apiVersion: v1
kind: Pod
metadata:
name: curl-custom-sa
spec: This pod uses the
serviceAccountName: foo foo ServiceAccount
containers: instead of the default.
- name: main
image: tutum/curl
command: ["sleep", "9999999"]
- name: ambassador
image: luksa/kubectl-proxy:1.6.2
To confirm that the custom ServiceAccount’s token is mounted into the two contain-
ers, you can print the contents of the token as shown in the following listing.
Listing 12.5 Inspecting the token mounted into the pod’s container(s)
You can see the token is the one from the foo ServiceAccount by comparing the token
string in listing 12.5 with the one in listing 12.2.
USING THE CUSTOM SERVICEACCOUNT’S TOKEN TO TALK TO THE API SERVER
Let’s see if you can talk to the API server using this token. As mentioned previously,
the ambassador container uses the token when talking to the server, so you can test
the token by going through the ambassador, which listens on localhost:8001, as
shown in the following listing.
Okay, you got back a proper response from the server, which means the custom
ServiceAccount is allowed to list pods. This may be because your cluster doesn’t use
the RBAC authorization plugin, or you gave all ServiceAccounts full permissions, as
instructed in chapter 8.
Securing the cluster with role-based access control 353
When your cluster isn’t using proper authorization, creating and using additional
ServiceAccounts doesn’t make much sense, since even the default ServiceAccount is
allowed to do anything. The only reason to use ServiceAccounts in that case is to
enforce mountable Secrets or to provide image pull Secrets through the Service-
Account, as explained earlier.
But creating additional ServiceAccounts is practically a must when you use the
RBAC authorization plugin, which we’ll explore next.
Update Secrets
And so on
The verbs in those examples (get, create, update) map to HTTP methods (GET, POST,
PUT) performed by the client (the complete mapping is shown in table 12.1). The
nouns (Pods, Service, Secrets) obviously map to Kubernetes resources.
An authorization plugin such as RBAC, which runs inside the API server, deter-
mines whether a client is allowed to perform the requested verb on the requested
resource or not.
GET, HEAD get (and watch for watching) list (and watch)
NOTE The additional verb use is used for PodSecurityPolicy resources, which
are explained in the next chapter.
Besides applying security permissions to whole resource types, RBAC rules can also
apply to specific instances of a resource (for example, a Service called myservice).
And later you’ll see that permissions can also apply to non-resource URL paths,
because not every path the API server exposes maps to a resource (such as the /api
path itself or the server health information at /healthz).
UNDERSTANDING THE RBAC PLUGIN
The RBAC authorization plugin, as the name suggests, uses user roles as the key factor
in determining whether the user may perform the action or not. A subject (which may
be a human, a ServiceAccount, or a group of users or ServiceAccounts) is associated
with one or more roles and each role is allowed to perform certain verbs on certain
resources.
If a user has multiple roles, they may do anything that any of their roles allows
them to do. If none of the user’s roles contains a permission to, for example, update
Secrets, the API server will prevent the user from performing PUT or PATCH requests
on Secrets.
Managing authorization through the RBAC plugin is simple. It’s all done by creat-
ing four RBAC-specific Kubernetes resources, which we’ll look at next.
Securing the cluster with role-based access control 355
Roles define what can be done, while bindings define who can do it (this is shown in
figure 12.2).
Who? What?
User A
Allows users
to access
Admins group
Some
resources
Role
Role
Binding
Other
resources
Service-
Account:
x
Doesn’t allow
doing anything
with other resources
Figure 12.2 Roles grant permissions, whereas RoleBindings bind Roles to subjects.
Cluster-level
ClusterRoleBinding ClusterRole resources
Figure 12.3 Roles and RoleBindings are namespaced; ClusterRoles and ClusterRoleBindings aren’t.
NOTE If you’re using GKE 1.6 or 1.7, you need to explicitly disable legacy autho-
rization by creating the cluster with the --no-enable-legacy-authorization
option. If you’re using Minikube, you also may need to enable RBAC by start-
ing Minikube with --extra-config=apiserver.Authorization.Mode=RBAC
If you followed the instructions on how to disable RBAC in chapter 8, now’s the time
to re-enable it by running the following command:
To try out RBAC, you’ll run a pod through which you’ll try to talk to the API server,
the way you did in chapter 8. But this time you’ll run two pods in different namespaces
to see how per-namespace security behaves.
In the examples in chapter 8, you ran two containers to demonstrate how an appli-
cation in one container uses the other container to talk to the API server. This time,
you’ll run a single container (based on the kubectl-proxy image) and use kubectl
exec to run curl inside that container directly. The proxy will take care of authentica-
tion and HTTPS, so you can focus on the authorization aspect of API server security.
Securing the cluster with role-based access control 357
Now open two terminals and use kubectl exec to run a shell inside each of the two
pods (one in each terminal). For example, to run the shell in the pod in namespace
foo, first get the name of the pod:
Do the same in the other terminal, but for the pod in the bar namespace.
LISTING SERVICES FROM YOUR PODS
To verify that RBAC is enabled and preventing the pod from reading cluster state, use
curl to list Services in the foo namespace:
/ # curl localhost:8001/api/v1/namespaces/foo/services
User "system:serviceaccount:foo:default" cannot list services in the
namespace "foo".
Allows getting
Role:
Services Services
service-reader
Allows listing
CREATING A ROLE
Create the previous Role in the foo namespace now:
$ kubectl create -f service-reader.yaml -n foo
role "service-reader" created
Note that if you’re using GKE, the previous command may fail because you don’t have
cluster-admin rights. To grant the rights, run the following command:
$ kubectl create clusterrolebinding cluster-admin-binding
➥ --clusterrole=cluster-admin [email protected]
Instead of creating the service-reader Role from a YAML file, you could also create
it with the special kubectl create role command. Let’s use this method to create the
Role in the bar namespace:
$ kubectl create role service-reader --verb=get --verb=list
➥ --resource=services -n bar
role "service-reader" created
These two Roles will allow you to list Services in the foo and bar namespaces from
within your two pods (running in the foo and bar namespace, respectively). But cre-
ating the two Roles isn’t enough (you can check by executing the curl command
again). You need to bind each of the Roles to the ServiceAccounts in their respec-
tive namespaces.
BINDING A ROLE TO A SERVICEACCOUNT
A Role defines what actions can be performed, but it doesn’t specify who can perform
them. To do that, you must bind the Role to a subject, which can be a user, a Service-
Account, or a group (of users or ServiceAccounts).
Binding Roles to subjects is achieved by creating a RoleBinding resource. To bind
the Role to the default ServiceAccount, run the following command:
$ kubectl create rolebinding test --role=service-reader
➥ --serviceaccount=foo:default -n foo
rolebinding "test" created
Namespace: foo
Default ServiceAccount
is allowed to get and list
services in this namespace
Figure 12.5 The test RoleBinding binds the default ServiceAccount with the
service-reader Role.
The following listing shows the YAML of the RoleBinding you created.
As you can see, a RoleBinding always references a single Role (as evident from the
roleRef property), but can bind the Role to multiple subjects (for example, one or
more ServiceAccounts and any number of users or groups). Because this RoleBinding
binds the Role to the ServiceAccount the pod in namespace foo is running under, you
can now list Services from within that pod.
/ # curl localhost:8001/api/v1/namespaces/foo/services
{
"kind": "ServiceList",
"apiVersion": "v1",
"metadata": {
"selfLink": "/api/v1/namespaces/foo/services",
Securing the cluster with role-based access control 361
"resourceVersion": "24906"
},
"items": []
The list of items is empty,
} because no Services exist.
Then add the following lines to the list of subjects, as shown in the following listing.
subjects:
- kind: ServiceAccount
name: default You’re referencing the default
namespace: bar ServiceAccount in the bar namespace.
Now you can also list Services in the foo namespace from inside the pod running in
the bar namespace. Run the same command as in listing 12.10, but do it in the other
terminal, where you’re running the shell in the other pod.
Before moving on to ClusterRoles and ClusterRoleBindings, let’s summarize
what RBAC resources you currently have. You have a RoleBinding in namespace
foo, which references the service-reader Role (also in the foo namespace) and
binds the default ServiceAccounts in both the foo and the bar namespaces, as
depicted in figure 12.6.
Figure 12.6 A RoleBinding binding ServiceAccounts from different namespaces to the same Role.
362 CHAPTER 12 Securing the Kubernetes API server
rules:
- apiGroups:
- ""
resources: In this case, the
- persistentvolumes rules are exactly
verbs:
like those in a
regular Role.
- get
- list
Before you bind this ClusterRole to your pod’s ServiceAccount, verify whether the pod
can list PersistentVolumes. Run the following command in the first terminal, where
you’re running the shell inside the pod in the foo namespace:
/ # curl localhost:8001/api/v1/persistentvolumes
User "system:serviceaccount:foo:default" cannot list persistentvolumes at the
cluster scope.
Hmm, that’s strange. Let’s examine the RoleBinding’s YAML in the following listing.
Can you tell what (if anything) is wrong with it?
subjects:
- kind: ServiceAccount The bound subject is the
name: default default ServiceAccount in
namespace: foo the foo namespace.
The YAML looks perfectly fine. You’re referencing the correct ClusterRole and the
correct ServiceAccount, as shown in figure 12.7, so what’s wrong?
Default ServiceAccount
is unable to get and list
PersistentVolumes
Although you can create a RoleBinding and have it reference a ClusterRole when you
want to enable access to namespaced resources, you can’t use the same approach for
cluster-level (non-namespaced) resources. To grant access to cluster-level resources,
you must always use a ClusterRoleBinding.
Luckily, creating a ClusterRoleBinding isn’t that different from creating a Role-
Binding, but you’ll clean up and delete the RoleBinding first:
$ kubectl delete rolebinding pv-test
rolebinding "pv-test" deleted
As you can see, you replaced rolebinding with clusterrolebinding in the command
and didn’t (need to) specify the namespace. Figure 12.8 shows what you have now.
Let’s see if you can list PersistentVolumes now:
/ # curl localhost:8001/api/v1/persistentvolumes
{
"kind": "PersistentVolumeList",
"apiVersion": "v1",
...
Securing the cluster with role-based access control 365
Default ServiceAccount in
foo namespace is now allowed
to get and list PersistentVolumes
Figure 12.8 A ClusterRoleBinding and ClusterRole must be used to grant access to cluster-
level resources.
You can! It turns out you must use a ClusterRole and a ClusterRoleBinding when
granting access to cluster-level resources.
You can see this ClusterRole refers to URLs instead of resources (field nonResource-
URLs is used instead of the resources field). The verbs field only allows the GET HTTP
method to be used on these URLs.
NOTE For non-resource URLs, plain HTTP verbs such as post, put, and
patch are used instead of create or update. The verbs need to be specified in
lowercase.
You can confirm this by accessing the /api URL path from inside the pod (through
the kubectl proxy, which means you’ll be authenticated as the pod’s ServiceAccount)
Securing the cluster with role-based access control 367
and from your local machine, without specifying any authentication tokens (making
you an unauthenticated user):
$ curl https://$(minikube ip):8443/api -k
{
"kind": "APIVersions",
"versions": [
...
This ClusterRole has many rules. Only the first one is shown in the listing. The rule
allows getting, listing, and watching resources like ConfigMaps, Endpoints, Persistent-
VolumeClaims, and so on. These are namespaced resources, even though you’re
looking at a ClusterRole (not a regular, namespaced Role). What exactly does this
ClusterRole do?
368 CHAPTER 12 Securing the Kubernetes API server
With the first command, you’re trying to list pods across all namespaces. With the sec-
ond, you’re trying to list pods in the foo namespace. The server doesn’t allow you to
do either.
Now, let’s see what happens when you create a ClusterRoleBinding and bind it to
the pod’s ServiceAccount:
$ kubectl create clusterrolebinding view-test --clusterrole=view
➥ --serviceaccount=foo:default
clusterrolebinding "view-test" created
Okay, the pod is allowed to list pods in a different namespace. It can also retrieve pods
across all namespaces by hitting the /api/v1/pods URL path:
/ # curl localhost:8001/api/v1/pods
{
"kind": "PodList",
"apiVersion": "v1",
...
Securing the cluster with role-based access control 369
As expected, the pod can get a list of all the pods in the cluster. To summarize, com-
bining a ClusterRoleBinding with a ClusterRole referring to namespaced resources
allows the pod to access namespaced resources in any namespace, as shown in fig-
ure 12.9.
Service-
Account: Pods Pods
default
Default
ServiceAccount
in foo namespace
Cluster-level is allowed to
resources view pods in
Allows getting, Pods, any namespace
listing, watching Services,
ClusterRoleBinding: ClusterRole:
Endpoints,
view-test view
ConfigMaps,
…
Figure 12.9 A ClusterRoleBinding and ClusterRole grants permission to resources across all
namespaces.
Now, let’s see what happens if you replace the ClusterRoleBinding with a RoleBinding.
First, delete the ClusterRoleBinding:
$ kubectl delete clusterrolebinding view-test
clusterrolebinding "view-test" deleted
You now have a RoleBinding in the foo namespace, binding the default Service-
Account in that same namespace with the view ClusterRole. What can your pod
access now?
/ # curl localhost:8001/api/v1/namespaces/foo/pods
{
"kind": "PodList",
"apiVersion": "v1",
...
370 CHAPTER 12 Securing the Kubernetes API server
/ # curl localhost:8001/api/v1/namespaces/bar/pods
User "system:serviceaccount:foo:default" cannot list pods in the namespace
"bar".
/ # curl localhost:8001/api/v1/pods
User "system:serviceaccount:foo:default" cannot list pods at the cluster
scope.
As you can see, your pod can list pods in the foo namespace, but not in any other spe-
cific namespace or across all namespaces. This is visualized in figure 12.10.
Service-
RoleBinding:
Account: Pods Pods
view-test
default
Figure 12.10 A RoleBinding referring to a ClusterRole only grants access to resources inside the
RoleBinding’s namespace.
Table 12.2 When to use specific combinations of role and binding types
Hopefully, the relationships between the four RBAC resources are much clearer
now. Don’t worry if you still feel like you don’t yet grasp everything. Things may
clear up as we explore the pre-configured ClusterRoles and ClusterRoleBindings in
the next section.
The most important roles are the view, edit, admin, and cluster-admin ClusterRoles.
They’re meant to be bound to ServiceAccounts used by user-defined pods.
ALLOWING READ-ONLY ACCESS TO RESOURCES WITH THE VIEW CLUSTERROLE
You already used the default view ClusterRole in the previous example. It allows read-
ing most resources in a namespace, except for Roles, RoleBindings, and Secrets. You’re
probably wondering, why not Secrets? Because one of those Secrets might include an
authentication token with greater privileges than those defined in the view Cluster-
Role and could allow the user to masquerade as a different user to gain additional
privileges (privilege escalation).
ALLOWING MODIFYING RESOURCES WITH THE EDIT CLUSTERROLE
Next is the edit ClusterRole, which allows you to modify resources in a namespace,
but also allows both reading and modifying Secrets. It doesn’t, however, allow viewing
or modifying Roles or RoleBindings—again, this is to prevent privilege escalation.
GRANTING FULL CONTROL OF A NAMESPACE WITH THE ADMIN CLUSTERROLE
Complete control of the resources in a namespace is granted in the admin Cluster-
Role. Subjects with this ClusterRole can read and modify any resource in the name-
space, except ResourceQuotas (we’ll learn what those are in chapter 14) and the
Namespace resource itself. The main difference between the edit and the admin Cluster-
Roles is in the ability to view and modify Roles and RoleBindings in the namespace.
NOTE To prevent privilege escalation, the API server only allows users to cre-
ate and update Roles if they already have all the permissions listed in that
Role (and for the same scope).
Although the Controller Manager runs as a single pod, each controller running
inside it can use a separate ClusterRole and ClusterRoleBinding (they’re prefixed
with system: controller:).
Each of these system ClusterRoles has a matching ClusterRoleBinding, which binds
it to the user the system component authenticates as. The system:kube-scheduler
ClusterRoleBinding, for example, assigns the identically named ClusterRole to the
system:kube-scheduler user, which is the username the scheduler Authenticates as.
12.3 Summary
This chapter has given you a foundation on how to secure the Kubernetes API server.
You learned the following:
Clients of the API server include both human users and applications running
in pods.
Applications in pods are associated with a ServiceAccount.
Both users and ServiceAccounts are associated with groups.
374 CHAPTER 12 Securing the Kubernetes API server
By default, pods run under the default ServiceAccount, which is created for
each namespace automatically.
Additional ServiceAccounts can be created manually and associated with a pod.
ServiceAccounts can be configured to allow mounting only a constrained list of
Secrets in a given pod.
A ServiceAccount can also be used to attach image pull Secrets to pods, so you
don’t need to specify the Secrets in every pod.
Roles and ClusterRoles define what actions can be performed on which resources.
RoleBindings and ClusterRoleBindings bind Roles and ClusterRoles to users,
groups, and ServiceAccounts.
Each cluster comes with default ClusterRoles and ClusterRoleBindings.
In the next chapter, you’ll learn how to protect the cluster nodes from pods and how
to isolate pods from each other by securing the network.
Securing cluster nodes
and the network
In the previous chapter, we talked about securing the API server. If an attacker
gets access to the API server, they can run whatever they like by packaging their
code into a container image and running it in a pod. But can they do any real
damage? Aren’t containers isolated from other containers and from the node
they’re running on?
Not necessarily. In this chapter, you’ll learn how to allow pods to access the
resources of the node they’re running on. You’ll also learn how to configure the
cluster so users aren’t able to do whatever they want with their pods. Then, in
375
376 CHAPTER 13 Securing cluster nodes and the network
the last part of the chapter, you’ll also learn how to secure the network the pods use
to communicate.
Node
eth0 lo
You can try running such a pod. The next listing shows an example pod manifest.
apiVersion: v1
kind: Pod
metadata:
name: pod-with-host-network
Using the host node’s namespaces in a pod 377
spec:
hostNetwork: true
Using the host node’s
containers: network namespace
- name: main
image: alpine
command: ["/bin/sleep", "999999"]
After you run the pod, you can use the following command to see that it’s indeed using
the host’s network namespace (it sees all the host’s network adapters, for example).
Listing 13.2 Network interfaces in a pod using the host’s network namespace
When the Kubernetes Control Plane components are deployed as pods (such as when
you deploy your cluster with kubeadm, as explained in appendix B), you’ll find that
those pods use the hostNetwork option, effectively making them behave as if they
weren’t running inside a pod.
Figure 13.2 Difference between pods using a hostPort and pods behind a NodePort service.
It’s important to understand that if a pod is using a specific host port, only one
instance of the pod can be scheduled to each node, because two processes can’t bind
to the same host port. The Scheduler takes this into account when scheduling pods, so
it doesn’t schedule multiple pods to the same node, as shown in figure 13.3. If you
have three nodes and want to deploy four pod replicas, only three will be scheduled
(one pod will remain Pending).
Only a single
replica per node
Pod 2 Port
8080
Figure 13.3 If a host port is used, only a single pod instance can be scheduled to a node.
Using the host node’s namespaces in a pod 379
Let’s see how to define the hostPort in a pod’s YAML definition. The following listing
shows the YAML to run your kubia pod and bind it to the node’s port 9000.
Listing 13.3 Binding a pod to a port in the node’s port space: kubia-hostport.yaml
apiVersion: v1
kind: Pod
metadata:
name: kubia-hostport
spec: The container can be
containers: reached on port 8080
- image: luksa/kubia
of the pod’s IP.
name: kubia
ports: It can also be reached
- containerPort: 8080 on port 9000 of the
hostPort: 9000 node it’s deployed on.
protocol: TCP
After you create this pod, you can access it through port 9000 of the node it’s sched-
uled to. If you have multiple nodes, you’ll see you can’t access the pod through that
port on the other nodes.
NOTE If you’re trying this on GKE, you need to configure the firewall prop-
erly using gcloud compute firewall-rules, the way you did in chapter 5.
The hostPort feature is primarily used for exposing system services, which are
deployed to every node using DaemonSets. Initially, people also used it to ensure two
replicas of the same pod were never scheduled to the same node, but now you have a
better way of achieving this—it’s explained in chapter 16.
Listing 13.4 Using the host’s PID and IPC namespaces: pod-with-host-pid-and-ipc.yaml
apiVersion: v1
kind: Pod
metadata:
You want the pod to
name: pod-with-host-pid-and-ipc
use the host’s PID
spec: namespace.
hostPID: true
hostIPC: true
You also want the
containers: pod to use the host’s
- name: main IPC namespace.
image: alpine
command: ["/bin/sleep", "999999"]
380 CHAPTER 13 Securing cluster nodes and the network
You’ll remember that pods usually see only their own processes, but if you run this pod
and then list the processes from within its container, you’ll see all the processes run-
ning on the host node, not only the ones running in the container, as shown in the
following listing.
By setting the hostIPC property to true, processes in the pod’s containers can also
communicate with all the other processes running on the node, through Inter-Process
Communication.
Let’s see what user and group ID the container is running as, and which groups it
belongs to. You can see this by running the id command inside the container:
$ kubectl exec pod-with-defaults id
uid=0(root) gid=0(root) groups=0(root), 1(bin), 2(daemon), 3(sys), 4(adm),
6(disk), 10(wheel), 11(floppy), 20(dialout), 26(tape), 27(video)
The container is running as user ID (uid) 0, which is root, and group ID (gid) 0 (also
root). It’s also a member of multiple other groups.
NOTE What user the container runs as is specified in the container image. In
a Dockerfile, this is done using the USER directive. If omitted, the container
runs as root.
Now, you’ll run a pod where the container runs as a different user.
apiVersion: v1
kind: Pod
metadata:
name: pod-as-user-guest
spec:
containers:
- name: main
image: alpine
command: ["/bin/sleep", "999999"] You need to specify a user ID, not
securityContext:
a username (id 405 corresponds
to the guest user).
runAsUser: 405
Now, to see the effect of the runAsUser property, run the id command in this new
pod, the way you did before:
$ kubectl exec pod-as-user-guest id
uid=405(guest) gid=100(users)
382 CHAPTER 13 Securing cluster nodes and the network
apiVersion: v1
kind: Pod
metadata:
name: pod-run-as-non-root
spec:
containers:
- name: main
image: alpine
command: ["/bin/sleep", "999999"] This container will only
securityContext: be allowed to run as a
runAsNonRoot: true non-root user.
If you deploy this pod, it gets scheduled, but is not allowed to run:
$ kubectl get po pod-run-as-non-root
NAME READY STATUS
pod-run-as-non-root 0/1 container has runAsNonRoot and image will run
➥ as root
Now, if anyone tampers with your container images, they won’t get far.
An example of such a pod is the kube-proxy pod, which needs to modify the node’s
iptables rules to make services work, as was explained in chapter 11. If you follow the
instructions in appendix B and deploy a cluster with kubeadm, you’ll see every cluster
node runs a kube-proxy pod and you can examine its YAML specification to see all the
special features it’s using.
To get full access to the node’s kernel, the pod’s container runs in privileged
mode. This is achieved by setting the privileged property in the container’s security-
Context property to true. You’ll create a privileged pod from the YAML in the follow-
ing listing.
apiVersion: v1
kind: Pod
metadata:
name: pod-privileged
spec:
containers:
- name: main
image: alpine
command: ["/bin/sleep", "999999"] This container will
securityContext: run in privileged
privileged: true mode
Go ahead and deploy this pod, so you can compare it with the non-privileged pod you
ran earlier.
If you’re familiar with Linux, you may know it has a special file directory called /dev,
which contains device files for all the devices on the system. These aren’t regular files on
disk, but are special files used to communicate with devices. Let’s see what devices are
visible in the non-privileged container you deployed earlier (the pod-with-defaults
pod), by listing files in its /dev directory, as shown in the following listing.
The listing shows all the devices. The list is fairly short. Now, compare this with the fol-
lowing listing, which shows the device files your privileged pod can see.
I haven’t included the whole list, because it’s too long for the book, but it’s evident
that the device list is much longer than before. In fact, the privileged container sees
all the host node’s devices. This means it can use any device freely.
For example, I had to use privileged mode like this when I wanted a pod running
on a Raspberry Pi to control LEDs connected it.
If you want to allow the container to change the system time, you can add a capabil-
ity called CAP_SYS_TIME to the container’s capabilities list, as shown in the follow-
ing listing.
apiVersion: v1
kind: Pod
metadata:
name: pod-add-settime-capability
spec:
containers:
- name: main
image: alpine
Configuring the container’s security context 385
NOTE Linux kernel capabilities are usually prefixed with CAP_. But when
specifying them in a pod spec, you must leave out the prefix.
If you run the same command in this new pod’s container, the system time is changed
successfully:
$ kubectl exec -it pod-add-settime-capability -- date +%T -s "12:00:00"
12:00:00
WARNING If you try this yourself, be aware that it may cause your worker
node to become unusable. In Minikube, although the system time was auto-
matically reset back by the Network Time Protocol (NTP) daemon, I had to
reboot the VM to schedule new pods.
You can confirm the node’s time has been changed by checking the time on the node
running the pod. In my case, I’m using Minikube, so I have only one node and I can
get its time like this:
$ minikube ssh date
Sun May 7 12:00:07 UTC 2017
Adding capabilities like this is a much better way than giving a container full privileges
with privileged: true. Admittedly, it does require you to know and understand what
each capability does.
TIP You’ll find the list of Linux kernel capabilities in the Linux man pages.
To prevent the container from doing that, you need to drop the capability by listing it
under the container’s securityContext.capabilities.drop property, as shown in
the following listing.
apiVersion: v1
kind: Pod
metadata:
name: pod-drop-chown-capability
spec:
containers:
- name: main
image: alpine
command: ["/bin/sleep", "999999"]
securityContext:
capabilities:
drop: You’re not allowing this container
- CHOWN to change file ownership.
By dropping the CHOWN capability, you’re not allowed to change the owner of the /tmp
directory in this pod:
$ kubectl exec pod-drop-chown-capability chown guest /tmp
chown: /tmp: Operation not permitted
You’re almost done exploring the container’s security context options. Let’s look at
one more.
apiVersion: v1
kind: Pod
metadata:
name: pod-with-readonly-filesystem
Configuring the container’s security context 387
spec:
containers:
- name: main
image: alpine
command: ["/bin/sleep", "999999"]
securityContext: This container’s filesystem
readOnlyRootFilesystem: true can’t be written to...
volumeMounts:
- name: my-volume ...but writing to /volume is
mountPath: /volume
allowed, becase a volume
is mounted there.
readOnly: false
volumes:
- name: my-volume
emptyDir:
When you deploy this pod, the container is running as root, which has write permis-
sions to the / directory, but trying to write a file there fails:
$ kubectl exec -it pod-with-readonly-filesystem touch /new-file
touch: /new-file: Read-only file system
As shown in the example, when you make the container’s filesystem read-only, you’ll
probably want to mount a volume in every directory the application writes to (for
example, logs, on-disk caches, and so on).
TIP To increase security, when running pods in production, set their con-
tainer’s readOnlyRootFilesystem property to true.
SETTING SECURITY CONTEXT OPTIONS AT THE POD LEVEL
In all these examples, you’ve set the security context of an individual container. Sev-
eral of these options can also be set at the pod level (through the pod.spec.security-
Context property). They serve as a default for all the pod’s containers but can be
overridden at the container level. The pod-level security context also allows you to set
additional properties, which we’ll explain next.
under its own specific user). If those two containers use a volume to share files, they
may not necessarily be able to read or write files of one another.
That’s why Kubernetes allows you to specify supplemental groups for all the pods
running in the container, allowing them to share files, regardless of the user IDs
they’re running as. This is done using the following two properties:
fsGroup
supplementalGroups
What they do is best explained in an example, so let’s see how to use them in a pod
and then see what their effect is. The next listing describes a pod with two containers
sharing the same volume.
apiVersion: v1
kind: Pod
metadata:
name: pod-with-shared-volume-fsgroup
spec:
securityContext: The fsGroup and supplementalGroups
fsGroup: 555 are defined in the security context at
supplementalGroups: [666, 777] the pod level.
containers:
- name: first
image: alpine
command: ["/bin/sleep", "999999"]
securityContext: The first container
runAsUser: 1111 runs as user ID 1111.
volumeMounts:
- name: shared-volume
mountPath: /volume
readOnly: false
- name: second
The second image: alpine
container command: ["/bin/sleep", "999999"] Both containers
runs as user securityContext: use the same
ID 2222. runAsUser: 2222 volume
volumeMounts:
- name: shared-volume
mountPath: /volume
readOnly: false
volumes:
- name: shared-volume
emptyDir:
After you create this pod, run a shell in its first container and see what user and group
IDs the container is running as:
$ kubectl exec -it pod-with-shared-volume-fsgroup -c first sh
/ $ id
uid=1111 gid=0(root) groups=555,666,777
Restricting the use of security-related features in pods 389
The id command shows the container is running with user ID 1111, as specified in the
pod definition. The effective group ID is 0(root), but group IDs 555, 666, and 777 are
also associated with the user.
In the pod definition, you set fsGroup to 555. Because of this, the mounted volume
will be owned by group ID 555, as shown here:
/ $ ls -l / | grep volume
drwxrwsrwx 2 root 555 6 May 29 12:23 volume
If you create a file in the mounted volume’s directory, the file is owned by user ID
1111 (that’s the user ID the container is running as) and by group ID 555:
/ $ echo foo > /volume/foo
/ $ ls -l /volume
total 4
-rw-r--r-- 1 1111 555 4 May 29 12:25 foo
This is different from how ownership is otherwise set up for newly created files. Usu-
ally, the user’s effective group ID, which is 0 in your case, is used when a user creates
files. You can see this by creating a file in the container’s filesystem instead of in the
volume:
/ $ echo foo > /tmp/foo
/ $ ls -l /tmp
total 4
-rw-r--r-- 1 1111 root 4 May 29 12:41 foo
As you can see, the fsGroup security context property is used when the process cre-
ates files in a volume (but this depends on the volume plugin used), whereas the
supplementalGroups property defines a list of additional group IDs the user is asso-
ciated with.
This concludes this section about the configuration of the container’s security con-
text. Next, we’ll see how a cluster administrator can restrict users from doing so.
PodSecurityPolicy admission control plugin running in the API server (we explained
admission control plugins in chapter 11).
When someone posts a pod resource to the API server, the PodSecurityPolicy admis-
sion control plugin validates the pod definition against the configured PodSecurity-
Policies. If the pod conforms to the cluster’s policies, it’s accepted and stored into
etcd; otherwise it’s rejected immediately. The plugin may also modify the pod
resource according to defaults configured in the policy.
The API server won’t start up until you create the password file you specified in the
command line options. This is how to create the file:
$ cat <<EOF | minikube ssh sudo tee /etc/kubernetes/passwd
password,alice,1000,basic-user
password,bob,2000,privileged-user
EOF
You’ll find a shell script that runs both commands in the book’s code archive in
Chapter13/minikube-with-rbac-and-psp-enabled.sh.
Which kernel capabilities are allowed, which are added by default and which are
always dropped
What SELinux labels a container can use
Whether a container can use a writable root filesystem or not
Which filesystem groups the container can run as
Which volume types a pod can use
If you’ve read this chapter up to this point, everything but the last item in the previous
list should be familiar. The last item should also be fairly clear.
EXAMINING A SAMPLE PODSECURITYPOLICY
The following listing shows a sample PodSecurityPolicy, which prevents pods from
using the host’s IPC, PID, and Network namespaces, and prevents running privileged
containers and the use of most host ports (except ports from 10000-11000 and 13000-
14000). The policy doesn’t set any constraints on what users, groups, or SELinux
groups the container can run as.
apiVersion: extensions/v1beta1
kind: PodSecurityPolicy
metadata:
name: default Containers aren’t
spec: allowed to use the
hostIPC: false host’s IPC, PID, or
hostPID: false
network namespace.
hostNetwork: false
hostPorts:
- min: 10000 They can only bind to host ports
max: 11000 10000 to 11000 (inclusive) or
- min: 13000 host ports 13000 to 14000.
max: 14000
privileged: false
Containers cannot run
readOnlyRootFilesystem: true in privileged mode.
runAsUser:
rule: RunAsAny
fsGroup:
Containers can
run as any user
rule: RunAsAny
and any group.
supplementalGroups:
rule: RunAsAny
seLinux: They can also use any
rule: RunAsAny SELinux groups they want.
volumes:
- '*' All volume types can
be used in pods.
Containers are forced to run
with a read-only root filesystem.
the cluster, the API server will no longer allow you to deploy the privileged pod used
earlier. For example
Likewise, you can no longer deploy pods that want to use the host’s PID, IPC, or Net-
work namespace. Also, because you set readOnlyRootFilesystem to true in the pol-
icy, the container filesystems in all pods will be read-only (containers can only write
to volumes).
runAsUser:
rule: MustRunAs
ranges:
- min: 2 Add a single range with min equal
max: 2 to max to set one specific ID.
fsGroup:
rule: MustRunAs
ranges:
- min: 2
max: 10
- min: 20
max: 30
supplementalGroups: Multiple ranges are
rule: MustRunAs
supported—here,
group IDs can be 2–10
ranges:
or 20–30 (inclusive).
- min: 2
max: 10
- min: 20
max: 30
Restricting the use of security-related features in pods 393
If the pod spec tries to set either of those fields to a value outside of these ranges, the
pod will not be accepted by the API server. To try this, delete the previous PodSecurity-
Policy and create the new one from the psp-must-run-as.yaml file.
NOTE Changing the policy has no effect on existing pods, because PodSecurity-
Policies are enforced only when creating or updating pods.
Okay, that was obvious. But what happens if you deploy a pod without setting the runAs-
User property, but the user ID is baked into the container image (using the USER direc-
tive in the Dockerfile)?
DEPLOYING A POD WITH A CONTAINER IMAGE WITH AN OUT-OF-RANGE USER ID
I’ve created an alternative image for the Node.js app you’ve used throughout the
book. The image is configured so that the container will run as user ID 5. The Docker-
file for the image is shown in the following listing.
Unlike before, the API server accepted the pod and the Kubelet has run its container.
Let’s see what user ID the container is running as:
$ kubectl exec run-as-5 -- id
uid=2(bin) gid=2(bin) groups=2(bin)
As you can see, the container is running as user ID 2, which is the ID you specified in
the PodSecurityPolicy. The PodSecurityPolicy can be used to override the user ID
hardcoded into a container image.
394 CHAPTER 13 Securing cluster nodes and the network
We’ll look at an example first, and then discuss what each of the three fields does. The
following listing shows a snippet of a PodSecurityPolicy resource defining three fields
related to capabilities.
apiVersion: extensions/v1beta1
kind: PodSecurityPolicy Allow containers to
spec: add the SYS_TIME
allowedCapabilities: capability.
- SYS_TIME
defaultAddCapabilities: Automatically add the CHOWN
- CHOWN capability to every container.
requiredDropCapabilities:
- SYS_ADMIN Require containers to
- SYS_MODULE drop the SYS_ADMIN and
... SYS_MODULE capabilities.
Listing 13.19 A PSP snippet allowing the use of only certain volume types:
psp-volumes.yaml
kind: PodSecurityPolicy
spec:
volumes:
- emptyDir
- configMap
- secret
- downwardAPI
- persistentVolumeClaim
If multiple PodSecurityPolicy resources are in place, pods can use any volume type
defined in any of the policies (the union of all volumes lists is used).
396 CHAPTER 13 Securing cluster nodes and the network
apiVersion: extensions/v1beta1
kind: PodSecurityPolicy
metadata: The name of this
name: privileged policy is "privileged.”
spec:
privileged: true
It allows running
runAsUser: privileged containers.
rule: RunAsAny
fsGroup:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
seLinux:
rule: RunAsAny
volumes:
- '*'
After you post this policy to the API server, you have two policies in the cluster:
$ kubectl get psp
NAME PRIV CAPS SELINUX RUNASUSER FSGROUP ...
default false [] RunAsAny RunAsAny RunAsAny ...
privileged true [] RunAsAny RunAsAny RunAsAny ...
As you can see in the PRIV column, the default policy doesn’t allow running privi-
leged containers, whereas the privileged policy does. Because you’re currently
logged in as a cluster-admin, you can see all the policies. When creating pods, if any
policy allows you to deploy a pod with certain features, the API server will accept
your pod.
Now imagine two additional users are using your cluster: Alice and Bob. You want
Alice to only deploy restricted (non-privileged) pods, but you want to allow Bob to
also deploy privileged pods. You do this by making sure Alice can only use the default
PodSecurityPolicy, while allowing Bob to use both.
USING RBAC TO ASSIGN DIFFERENT PODSECURITYPOLICIES TO DIFFERENT USERS
In the previous chapter, you used RBAC to grant users access to only certain resource
types, but I mentioned that access can be granted to specific resource instances by ref-
erencing them by name. That’s what you’ll use to make users use different Pod-
SecurityPolicy resources.
First, you’ll create two ClusterRoles, each allowing the use of one of the policies.
You’ll call the first one psp-default and in it allow the use of the default Pod-
SecurityPolicy resource. You can use kubectl create clusterrole to do that:
$ kubectl create clusterrole psp-default --verb=use
➥ --resource=podsecuritypolicies --resource-name=default
clusterrole "psp-default" created
NOTE You’re using the special verb use instead of get, list, watch, or similar.
Now, you need to bind these two policies to users. As you may remember from the pre-
vious chapter, if you’re binding a ClusterRole that grants access to cluster-level
resources (which is what PodSecurityPolicy resources are), you need to use a Cluster-
RoleBinding instead of a (namespaced) RoleBinding.
You’re going to bind the psp-default ClusterRole to all authenticated users, not
only to Alice. This is necessary because otherwise no one could create any pods,
because the Admission Control plugin would complain that no policy is in place.
Authenticated users all belong to the system:authenticated group, so you’ll bind
the ClusterRole to the group:
$ kubectl create clusterrolebinding psp-all-users
➥ --clusterrole=psp-default --group=system:authenticated
clusterrolebinding "psp-all-users" created
398 CHAPTER 13 Securing cluster nodes and the network
As an authenticated user, Alice should now have access to the default PodSecurity-
Policy, whereas Bob should have access to both the default and the privileged Pod-
SecurityPolicies. Alice shouldn’t be able to create privileged pods, whereas Bob
should. Let’s see if that’s true.
CREATING ADDITIONAL USERS FOR KUBECTL
But how do you authenticate as Alice or Bob instead of whatever you’re authenticated
as currently? The book’s appendix A explains how kubectl can be used with multiple
clusters, but also with multiple contexts. A context includes the user credentials used
for talking to a cluster. Turn to appendix A to find out more. Here we’ll show the bare
commands enabling you to use kubectl as Alice or Bob.
First, you’ll create two new users in kubectl’s config with the following two
commands:
It should be obvious what the commands do. Because you’re setting username and
password credentials, kubectl will use basic HTTP authentication for these two users
(other authentication methods include tokens, client certificates, and so on).
CREATING PODS AS A DIFFERENT USER
You can now try creating a privileged pod while authenticating as Alice. You can tell
kubectl which user credentials to use by using the --user option:
As expected, the API server doesn’t allow Alice to create privileged pods. Now, let’s see
if it allows Bob to do that:
And there you go. You’ve successfully used RBAC to make the Admission Control
plugin use different PodSecurityPolicy resources for different users.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny Empty pod selector
spec: matches all pods in the
podSelector: same namespace
When you create this NetworkPolicy in a certain namespace, no one can connect to
any pod in that namespace.
400 CHAPTER 13 Securing cluster nodes and the network
NOTE The CNI plugin or other type of networking solution used in the clus-
ter must support NetworkPolicy, or else there will be no effect on inter-pod
connectivity.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: postgres-netpolicy
spec: This policy secures
podSelector: access to pods with
matchLabels: app=database label.
app: database
ingress:
- from: It allows incoming connections
- podSelector: only from pods with the
matchLabels: app=webserver label.
app: webserver
ports:
- port: 5432 Connections to this
port are allowed.
The example NetworkPolicy allows pods with the app=webserver label to connect to
pods with the app=database label, and only on port 5432. Other pods can’t connect to
the database pods, and no one (not even the webserver pods) can connect to anything
other than port 5432 of the database pods. This is shown in figure 13.4.
Client pods usually connect to server pods through a Service instead of directly to
the pod, but that doesn’t change anything. The NetworkPolicy is enforced when con-
necting through a Service, as well.
Isolating the pod network 401
NetworkPolicy: postgres-netpolicy
app: webserver
Pod: Port
webserver 5432
app: database
app: webserver Pod: Other pods
database
Pod:
webserver Port
9876
Figure 13.4 A NetworkPolicy allowing only some pods to access other pods and only on a specific
port
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: shoppingcart-netpolicy
spec:
podSelector: This policy applies to pods
matchLabels: labeled as microservice=
app: shopping-cart shopping-cart.
ingress:
- from:
- namespaceSelector: Only pods running in namespaces
matchLabels: labeled as tenant=manning are
tenant: manning allowed to access the microservice.
ports:
- port: 80
402 CHAPTER 13 Securing cluster nodes and the network
tenant: manning
Pod selector:
Pods app=shopping-cart
Pods
Figure 13.5 A NetworkPolicy only allowing pods in namespaces matching a namespaceSelector to access a
specific pod.
If the shopping cart provider also wants to give access to other tenants (perhaps to
one of their partner companies), they can either create an additional NetworkPolicy
resource or add an additional ingress rule to their existing NetworkPolicy.
ingress:
- from:
- ipBlock: This ingress rule only allows traffic from
cidr: 192.168.1.0/24 clients in the 192.168.1.0/24 IP block.
Summary 403
spec:
podSelector:
matchLabels:
This policy applies to pods with
the app=webserver label.
app: webserver
egress:
It limits
the pods’ - to:
- podSelector: Webserver pods may only
outbound connect to pods with the
traffic. matchLabels:
app=database label.
app: database
The NetworkPolicy in the previous listing allows pods that have the app=webserver
label to only access pods that have the app=database label and nothing else (neither
other pods, nor any other IP, regardless of whether it’s internal or external to the
cluster).
13.5 Summary
In this chapter, you learned about securing cluster nodes from pods and pods from
other pods. You learned that
Pods can use the node’s Linux namespaces instead of using their own.
Containers can be configured to run as a different user and/or group than the
one defined in the container image.
Containers can also run in privileged mode, allowing them to access the node’s
devices that are otherwise not exposed to pods.
Containers can be run as read-only, preventing processes from writing to the
container’s filesystem (and only allowing them to write to mounted volumes).
Cluster-level PodSecurityPolicy resources can be created to prevent users from
creating pods that could compromise a node.
PodSecurityPolicy resources can be associated with specific users using RBAC’s
ClusterRoles and ClusterRoleBindings.
NetworkPolicy resources are used to limit a pod’s inbound and/or outbound
traffic.
In the next chapter, you’ll learn how computational resources available to pods can be
constrained and how a pod’s quality of service is configured.
Managing pods’
computational resources
Up to now you’ve created pods without caring about how much CPU and memory
they’re allowed to consume. But as you’ll see in this chapter, setting both how
much a pod is expected to consume and the maximum amount it’s allowed to con-
sume is a vital part of any pod definition. Setting these two sets of parameters
makes sure that a pod takes only its fair share of the resources provided by the
Kubernetes cluster and also affects how pods are scheduled across the cluster.
404
Requesting resources for a pod’s containers 405
apiVersion: v1
kind: Pod
metadata:
name: requests-pod
spec:
containers:
- image: busybox
command: ["dd", "if=/dev/zero", "of=/dev/null"]
name: main
resources: You’re specifying resource
requests:
requests for the main container.
cpu: 200m
memory: 10Mi The container requests 200
millicores (that is, 1/5 of a
The container also single CPU core’s time).
requests 10 mebibytes
of memory.
In the pod manifest, your single container requires one-fifth of a CPU core (200 mil-
licores) to run properly. Five such pods/containers can run sufficiently fast on a single
CPU core.
When you don’t specify a request for CPU, you’re saying you don’t care how much
CPU time the process running in your container is allotted. In the worst case, it may
not get any CPU time at all (this happens when a heavy demand by other processes
exists on the CPU). Although this may be fine for low-priority batch jobs, which aren’t
time-critical, it obviously isn’t appropriate for containers handling user requests.
In the pod spec, you’re also requesting 10 mebibytes of memory for the container.
By doing that, you’re saying that you expect the processes running inside the con-
tainer to use at most 10 mebibytes of RAM. They might use less, but you’re not expect-
ing them to use more than that in normal circumstances. Later in this chapter you’ll
see what happens if they do.
Now you’ll run the pod. When the pod starts, you can take a quick look at the pro-
cess’ CPU consumption by running the top command inside the container, as shown
in the following listing.
406 CHAPTER 14 Managing pods’ computational resources
Listing 14.2 Examining CPU and memory usage from within a container
The dd command you’re running in the container consumes as much CPU as it can,
but it only runs a single thread so it can only use a single core. The Minikube VM,
which is where this example is running, has two CPU cores allotted to it. That’s why
the process is shown consuming 50% of the whole CPU.
Fifty percent of two cores is obviously one whole core, which means the container
is using more than the 200 millicores you requested in the pod specification. This is
expected, because requests don’t limit the amount of CPU a container can use. You’d
need to specify a CPU limit to do that. You’ll try that later, but first, let’s see how spec-
ifying resource requests in a pod affects the scheduling of the pod.
Node
CPU requests
Pod D
Memory requests
Figure 14.1 The Scheduler only cares about requests, not actual usage.
UNDERSTANDING HOW THE SCHEDULER USES PODS’ REQUESTS WHEN SELECTING THE BEST NODE
FOR A POD
You may remember from chapter 11 that the Scheduler first filters the list of nodes to
exclude those that the pod can’t fit on and then prioritizes the remaining nodes per the
configured prioritization functions. Among others, two prioritization functions rank
nodes based on the amount of resources requested: LeastRequestedPriority and
MostRequestedPriority. The first one prefers nodes with fewer requested resources
(with a greater amount of unallocated resources), whereas the second one is the exact
opposite—it prefers nodes that have the most requested resources (a smaller amount of
unallocated CPU and memory). But, as we’ve discussed, they both consider the amount
of requested resources, not the amount of resources actually consumed.
The Scheduler is configured to use only one of those functions. You may wonder
why anyone would want to use the MostRequestedPriority function. After all, if you
have a set of nodes, you usually want to spread CPU load evenly across them. However,
that’s not the case when running on cloud infrastructure, where you can add and
remove nodes whenever necessary. By configuring the Scheduler to use the Most-
RequestedPriority function, you guarantee that Kubernetes will use the smallest pos-
sible number of nodes while still providing each pod with the amount of CPU/memory
it requests. By keeping pods tightly packed, certain nodes are left vacant and can be
removed. Because you’re paying for individual nodes, this saves you money.
INSPECTING A NODE’S CAPACITY
Let’s see the Scheduler in action. You’ll deploy another pod with four times the
amount of requested resources as before. But before you do that, let’s see your node’s
capacity. Because the Scheduler needs to know how much CPU and memory each
node has, the Kubelet reports this data to the API server, making it available through
408 CHAPTER 14 Managing pods’ computational resources
the Node resource. You can see it by using the kubectl describe command as in the
following listing.
The output shows two sets of amounts related to the available resources on the node:
the node’s capacity and allocatable resources. The capacity represents the total resources
of a node, which may not all be available to pods. Certain resources may be reserved
for Kubernetes and/or system components. The Scheduler bases its decisions only on
the allocatable resource amounts.
In the previous example, the node called minikube runs in a VM with two cores
and has no CPU reserved, making the whole CPU allocatable to pods. Therefore,
the Scheduler should have no problem scheduling another pod requesting 800
millicores.
Run the pod now. You can use the YAML file in the code archive, or run the pod
with the kubectl run command like this:
$ kubectl run requests-pod-2 --image=busybox --restart Never
➥ --requests='cpu=800m,memory=20Mi' -- dd if=/dev/zero of=/dev/null
pod "requests-pod-2" created
NOTE This time you’re specifying the CPU request in whole cores (cpu=1)
instead of millicores (cpu=1000m).
So far, so good. The pod has been accepted by the API server (you’ll remember from
the previous chapter that the API server can reject pods if they’re invalid in any way).
Now, check if the pod is running:
$ kubectl get po requests-pod-3
NAME READY STATUS RESTARTS AGE
requests-pod-3 0/1 Pending 0 4m
Even if you wait a while, the pod is still stuck at Pending. You can see more informa-
tion on why that’s the case by using the kubectl describe command, as shown in
the following listing.
Listing 14.4 Examining why a pod is stuck at Pending with kubectl describe pod
The output shows that the pod hasn’t been scheduled because it can’t fit on any node
due to insufficient CPU on your single node. But why is that? The sum of the CPU
requests of all three pods equals 2,000 millicores or exactly two cores, which is exactly
what your node can provide. What’s wrong?
DETERMINING WHY A POD ISN’T BEING SCHEDULED
You can figure out why the pod isn’t being scheduled by inspecting the node resource.
Use the kubectl describe node command again and examine the output more
closely in the following listing.
Listing 14.5 Inspecting allocated resources on a node with kubectl describe node
If you look at the bottom left of the listing, you’ll see a total of 1,275 millicores have
been requested by the running pods, which is 275 millicores more than what you
requested for the first two pods you deployed. Something is eating up additional
CPU resources.
You can find the culprit in the list of pods in the previous listing. Three pods in the
kube-system namespace have explicitly requested CPU resources. Those pods plus
your two pods leave only 725 millicores available for additional pods. Because your
third pod requested 1,000 millicores, the Scheduler won’t schedule it to this node, as
that would make the node overcommitted.
FREEING RESOURCES TO GET THE POD SCHEDULED
The pod will only be scheduled when an adequate amount of CPU is freed (when one
of the first two pods is deleted, for example). If you delete your second pod, the
Scheduler will be notified of the deletion (through the watch mechanism described in
chapter 11) and will schedule your third pod as soon as the second pod terminates.
This is shown in the following listing.
$ kubectl get po
NAME READY STATUS RESTARTS AGE
requests-pod 1/1 Running 0 2h
requests-pod-2 1/1 Terminating 0 1h
requests-pod-3 0/1 Pending 0 1h
$ kubectl get po
NAME READY STATUS RESTARTS AGE
requests-pod 1/1 Running 0 2h
requests-pod-3 1/1 Running 0 1h
In all these examples, you’ve specified a request for memory, but it hasn’t played any
role in the scheduling because your node has more than enough allocatable memory to
accommodate all your pods’ requests. Both CPU and memory requests are treated the
same way by the Scheduler, but in contrast to memory requests, a pod’s CPU requests
also play a role elsewhere—while the pod is running. You’ll learn about this next.
Requesting resources for a pod’s containers 411
133 m 667 m
(1/6) (5/6)
CPU Pod A:
Pod B: 1000 m 800 m available
requests 200 m
CPU Pod A:
Pod B: 1667 m
usage 333 m
0m 1000 m 2000 m
Figure 14.2 Unused CPU time is distributed to containers based on their CPU requests.
But if one container wants to use up as much CPU as it can, while the other one is sit-
ting idle at a given moment, the first container will be allowed to use the whole CPU
time (minus the small amount of time used by the second container, if any). After all,
it makes sense to use all the available CPU if no one else is using it, right? As soon as
the second container needs CPU time, it will get it and the first container will be throt-
tled back.
First, you obviously need to make Kubernetes aware of your custom resource by
adding it to the Node object’s capacity field. This can be done by performing a
PATCH HTTP request. The resource name can be anything, such as example.org/my-
resource, as long as it doesn’t start with the kubernetes.io domain. The quantity
must be an integer (for example, you can’t set it to 100 millis, because 0.1 isn’t an inte-
ger; but you can set it to 1000m or 2000m or, simply, 1 or 2). The value will be copied
from the capacity to the allocatable field automatically.
Then, when creating pods, you specify the same resource name and the requested
quantity under the resources.requests field in the container spec or with --requests
when using kubectl run like you did in previous examples. The Scheduler will make
sure the pod is only deployed to a node that has the requested amount of the custom
resource available. Every deployed pod obviously reduces the number of allocatable
units of the resource.
An example of a custom resource could be the number of GPU units available on the
node. Pods requiring the use of a GPU specify that in their requests. The Scheduler then
makes sure the pod is only scheduled to nodes with at least one GPU still unallocated.
14.2.1 Setting a hard limit for the amount of resources a container can use
We’ve seen how containers are allowed to use up all the CPU if all the other processes
are sitting idle. But you may want to prevent certain containers from using up more
than a specific amount of CPU. And you’ll always want to limit the amount of memory
a container can consume.
CPU is a compressible resource, which means the amount used by a container can
be throttled without affecting the process running in the container in an adverse way.
Memory is obviously different—it’s incompressible. Once a process is given a chunk of
memory, that memory can’t be taken away from it until it’s released by the process
itself. That’s why you need to limit the maximum amount of memory a container can
be given.
Without limiting memory, a container (or a pod) running on a worker node may
eat up all the available memory and affect all other pods on the node and any new
pods scheduled to the node (remember that new pods are scheduled to the node
based on the memory requests and not actual memory usage). A single malfunction-
ing or malicious pod can practically make the whole node unusable.
CREATING A POD WITH RESOURCE LIMITS
To prevent this from happening, Kubernetes allows you to specify resource limits for
every container (along with, and virtually in the same way as, resource requests). The
following listing shows an example pod manifest with resource limits.
Limiting resources available to a container 413
Listing 14.7 A pod with a hard limit on CPU and memory: limited-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: limited-pod
spec:
containers:
- image: busybox
command: ["dd", "if=/dev/zero", "of=/dev/null"]
name: main
resources: Specifying resource
limits: limits for the container
cpu: 1
memory: 20Mi
This container will be
The container will be allowed to use at
allowed to use up to 20 most 1 CPU core.
mebibytes of memory.
This pod’s container has resource limits configured for both CPU and memory. The
process or processes running inside the container will not be allowed to consume
more than 1 CPU core and 20 mebibytes of memory.
NOTE Because you haven’t specified any resource requests, they’ll be set to
the same values as the resource limits.
OVERCOMMITTING LIMITS
Unlike resource requests, resource limits aren’t constrained by the node’s allocatable
resource amounts. The sum of all limits of all the pods on a node is allowed to exceed
100% of the node’s capacity (figure 14.3). Restated, resource limits can be overcom-
mitted. This has an important consequence—when 100% of the node’s resources are
used up, certain containers will need to be killed.
Node
Figure 14.3 The sum of resource limits of all pods on a node can exceed 100% of the node’s
capacity.
You’ll see how Kubernetes decides which containers to kill in section 14.3, but individ-
ual containers can be killed even if they try to use more than their resource limits
specify. You’ll learn more about this next.
414 CHAPTER 14 Managing pods’ computational resources
The CrashLoopBackOff status doesn’t mean the Kubelet has given up. It means that
after each crash, the Kubelet is increasing the time period before restarting the con-
tainer. After the first crash, it restarts the container immediately and then, if it crashes
again, waits for 10 seconds before restarting it again. On subsequent crashes, this
delay is then increased exponentially to 20, 40, 80, and 160 seconds, and finally lim-
ited to 300 seconds. Once the interval hits the 300-second limit, the Kubelet keeps
restarting the container indefinitely every five minutes until the pod either stops
crashing or is deleted.
To examine why the container crashed, you can check the pod’s log and/or use
the kubectl describe pod command, as shown in the following listing.
Listing 14.8 Inspecting why a container terminated with kubectl describe pod
The OOMKilled status tells you that the container was killed because it was out of mem-
ory. In the previous listing, the container went over its memory limit and was killed
immediately.
It’s important not to set memory limits too low if you don’t want your container to
be killed. But containers can get OOMKilled even if they aren’t over their limit. You’ll
see why in section 14.3.2, but first, let’s discuss something that catches most users off-
guard the first time they start specifying limits for their containers.
Now, run the top command in the container, the way you did at the beginning of the
chapter. The command’s output is shown in the following listing.
Listing 14.9 Running the top command in a CPU- and memory-limited container
First, let me remind you that the pod’s CPU limit is set to 1 core and its memory limit
is set to 20 MiB. Now, examine the output of the top command closely. Is there any-
thing that strikes you as odd?
Look at the amount of used and free memory. Those numbers are nowhere near
the 20 MiB you set as the limit for the container. Similarly, you set the CPU limit to
one core and it seems like the main process is using only 50% of the available CPU
time, even though the dd command, when used like you’re using it, usually uses all the
CPU it has available. What’s going on?
UNDERSTANDING THAT CONTAINERS ALWAYS SEE THE NODE’S MEMORY, NOT THE CONTAINER’S
The top command shows the memory amounts of the whole node the container is
running on. Even though you set a limit on how much memory is available to a con-
tainer, the container will not be aware of this limit.
416 CHAPTER 14 Managing pods’ computational resources
This has an unfortunate effect on any application that looks up the amount of
memory available on the system and uses that information to decide how much mem-
ory it wants to reserve.
The problem is visible when running Java apps, especially if you don’t specify the
maximum heap size for the Java Virtual Machine with the -Xmx option. In that case,
the JVM will set the maximum heap size based on the host’s total memory instead of
the memory available to the container. When you run your containerized Java apps in
a Kubernetes cluster on your laptop, the problem doesn’t manifest itself, because the
difference between the memory limits you set for the pod and the total memory avail-
able on your laptop is not that great.
But when you deploy your pod onto a production system, where nodes have much
more physical memory, the JVM may go over the container’s memory limit you config-
ured and will be OOMKilled.
And if you think setting the -Xmx option properly solves the issue, you’re wrong,
unfortunately. The -Xmx option only constrains the heap size, but does nothing about
the JVM’s off-heap memory. Luckily, new versions of Java alleviate that problem by tak-
ing the configured container limits into account.
UNDERSTANDING THAT CONTAINERS ALSO SEE ALL THE NODE’S CPU CORES
Exactly like with memory, containers will also see all the node’s CPUs, regardless of
the CPU limits configured for the container. Setting a CPU limit to one core doesn’t
magically only expose only one CPU core to the container. All the CPU limit does is
constrain the amount of CPU time the container can use.
A container with a one-core CPU limit running on a 64-core CPU will get 1/64th
of the overall CPU time. And even though its limit is set to one core, the container’s
processes will not run on only one core. At different points in time, its code may be
executed on different cores.
Nothing is wrong with this, right? While that’s generally the case, at least one sce-
nario exists where this situation is catastrophic.
Certain applications look up the number of CPUs on the system to decide how
many worker threads they should run. Again, such an app will run fine on a develop-
ment laptop, but when deployed on a node with a much bigger number of cores, it’s
going to spin up too many threads, all competing for the (possibly) limited CPU time.
Also, each thread requires additional memory, causing the apps memory usage to sky-
rocket.
You may want to use the Downward API to pass the CPU limit to the container and
use it instead of relying on the number of CPUs your app can see on the system. You
can also tap into the cgroups system directly to get the configured CPU limit by read-
ing the following files:
/sys/fs/cgroup/cpu/cpu.cfs_quota_us
/sys/fs/cgroup/cpu/cpu.cfs_period_us
Understanding pod QoS classes 417
the pod to be Guaranteed. Containers in those pods get the requested amount of
resources, but cannot consume additional ones (because their limits are no higher
than their requests).
ASSIGNING THE BURSTABLE QOS CLASS TO A POD
In between BestEffort and Guaranteed is the Burstable QoS class. All other pods
fall into this class. This includes single-container pods where the container’s limits
don’t match its requests and all pods where at least one container has a resource
request specified, but not the limit. It also includes pods where one container’s
requests match their limits, but another container has no requests or limits specified.
Burstable pods get the amount of resources they request, but are allowed to use addi-
tional resources (up to the limit) if needed.
UNDERSTANDING HOW THE RELATIONSHIP BETWEEN REQUESTS AND LIMITS DEFINES THE QOS CLASS
All three QoS classes and their relationships with requests and limits are shown in fig-
ure 14.4.
Requests
Thinking about what QoS class a pod has can make your head spin, because it involves
multiple containers, multiple resources, and all the possible relationships between
requests and limits. It’s easier if you start by thinking about QoS at the container level
(although QoS classes are a property of pods, not containers) and then derive the
pod’s QoS class from the QoS classes of containers.
FIGURING OUT A CONTAINER’S QOS CLASS
Table 14.1 shows the QoS class based on how resource requests and limits are
defined on a single container. For single-container pods, the QoS class applies to
the pod as well.
Understanding pod QoS classes 419
Table 14.1 The QoS class of a single-container pod based on resource requests and limits
CPU requests vs. limits Memory requests vs. limits Container QoS class
NOTE If only requests are set, but not limits, refer to the table rows where
requests are less than the limits. If only limits are set, requests default to the
limits, so refer to the rows where requests equal limits.
FIGURING OUT THE QOS CLASS OF A POD WITH MULTIPLE CONTAINERS
For multi-container pods, if all the containers have the same QoS class, that’s also the
pod’s QoS class. If at least one container has a different class, the pod’s QoS class is
Burstable, regardless of what the container classes are. Table 14.2 shows how a two-
container pod’s QoS class relates to the classes of its two containers. You can easily
extend this to pods with more than two containers.
Table 14.2 A Pod’s QoS class derived from the classes of its containers
NOTE A pod’s QoS class is shown when running kubectl describe pod and
in the pod’s YAML/JSON manifest in the status.qosClass field.
We’ve explained how QoS classes are determined, but we still need to look at how they
determine which container gets killed in an overcommitted system.
420 CHAPTER 14 Managing pods’ computational resources
90% used
70% used
99% used
Requests
Requests
Requests
Requests
Limits
Limits
Limits
Limits
Pod A Pod B Pod C Pod D
Obviously, a BestEffort pod’s process will also be killed before any Guaranteed pods’
processes are killed. Likewise, a Burstable pod’s process will also be killed before that
of a Guaranteed pod. But what happens if there are only two Burstable pods? Clearly,
the selection process needs to prefer one over the other.
UNDERSTANDING HOW CONTAINERS WITH THE SAME QOS CLASS ARE HANDLED
Each running process has an OutOfMemory (OOM) score. The system selects the
process to kill by comparing OOM scores of all the running processes. When memory
needs to be freed, the process with the highest score gets killed.
OOM scores are calculated from two things: the percentage of the available mem-
ory the process is consuming and a fixed OOM score adjustment, which is based on the
pod’s QoS class and the container’s requested memory. When two single-container pods
exist, both in the Burstable class, the system will kill the one using more of its requested
Setting default requests and limits for pods per namespace 421
memory than the other, percentage-wise. That’s why in figure 14.5, pod B, using 90%
of its requested memory, gets killed before pod C, which is only using 70%, even
though it’s using more megabytes of memory than pod B.
This shows you need to be mindful of not only the relationship between requests
and limits, but also of requests and the expected actual memory consumption.
Rejected because
requests and limits are
outside min/max values
API server
LimitRange
Pod A Pod A
manifest manifest - Min/max CPU
- Min/max memory
- Requests - Requests
- Limits - Limits - Default requests
- Default limits
Defaults
applied
LimitRange resources are used by the LimitRanger Admission Control plugin (we
explained what those plugins are in chapter 11). When a pod manifest is posted to the
API server, the LimitRanger plugin validates the pod spec. If validation fails, the mani-
fest is rejected immediately. Because of this, a great use-case for LimitRange objects is
to prevent users from creating pods that are bigger than any node in the cluster. With-
out such a LimitRange, the API server will gladly accept the pod, but then never
schedule it.
The limits specified in a LimitRange resource apply to each individual pod/con-
tainer or other kind of object created in the same namespace as the LimitRange
object. They don’t limit the total amount of resources available across all the pods in
the namespace. This is specified through ResourceQuota objects, which are explained
in section 14.5.
apiVersion: v1
kind: LimitRange
metadata:
name: example Specifies the
spec: limits for a pod
limits: as a whole
- type: Pod
min:
cpu: 50m Minimum CPU and memory all the
memory: 5Mi
pod’s containers can request in total
max:
cpu: 1 Maximum CPU and memory all the pod’s
memory: 1Gi
containers can request (and limit)
- type: Container
The
container defaultRequest: Default requests for CPU and memory
limits are cpu: 100m that will be applied to containers that
specified memory: 10Mi don’t specify them explicitly
below this default:
cpu: 200m Default limits for containers
line. that don’t specify them
memory: 100Mi
min:
cpu: 50m
memory: 5Mi Minimum and maximum
max: requests/limits that a
cpu: 1
container can have
memory: 1Gi
maxLimitRequestRatio: Maximum ratio between
cpu: 4 the limit and request
memory: 10 for each resource
Setting default requests and limits for pods per namespace 423
- type: PersistentVolumeClaim
min: A LimitRange can also set
storage: 1Gi the minimum and maximum
max: amount of storage a PVC
storage: 10Gi
can request.
As you can see from the previous example, the minimum and maximum limits for a
whole pod can be configured. They apply to the sum of all the pod’s containers’
requests and limits.
Lower down, at the container level, you can set not only the minimum and maxi-
mum, but also default resource requests (defaultRequest) and default limits
(default) that will be applied to each container that doesn’t specify them explicitly.
Beside the min, max, and default values, you can even set the maximum ratio of
limits vs. requests. The previous listing sets the CPU maxLimitRequestRatio to 4,
which means a container’s CPU limits will not be allowed to be more than four times
greater than its CPU requests. A container requesting 200 millicores will not be
accepted if its CPU limit is set to 801 millicores or higher. For memory, the maximum
ratio is set to 10.
In chapter 6 we looked at PersistentVolumeClaims (PVC), which allow you to claim
a certain amount of persistent storage similarly to how a pod’s containers claim CPU
and memory. In the same way you’re limiting the minimum and maximum amount of
CPU a container can request, you should also limit the amount of storage a single
PVC can request. A LimitRange object allows you to do that as well, as you can see at
the bottom of the example.
The example shows a single LimitRange object containing limits for everything,
but you could also split them into multiple objects if you prefer to have them orga-
nized per type (one for pod limits, another for container limits, and yet another for
PVCs, for example). Limits from multiple LimitRange objects are all consolidated
when validating a pod or PVC.
Because the validation (and defaults) configured in a LimitRange object is per-
formed by the API server when it receives a new pod or PVC manifest, if you modify
the limits afterwards, existing pods and PVCs will not be revalidated—the new limits
will only apply to pods and PVCs created afterward.
Listing 14.11 A pod with CPU requests greater than the limit: limits-pod-too-big.yaml
resources:
requests:
cpu: 2
424 CHAPTER 14 Managing pods’ computational resources
The pod’s single container is requesting two CPUs, which is more than the maximum
you set in the LimitRange earlier. Creating the pod yields the following result:
$ kubectl create -f limits-pod-too-big.yaml
Error from server (Forbidden): error when creating "limits-pod-too-big.yaml":
pods "too-big" is forbidden: [
maximum cpu usage per Pod is 1, but request is 2.,
maximum cpu usage per Container is 1, but request is 2.]
I’ve modified the output slightly to make it more legible. The nice thing about the
error message from the server is that it lists all the reasons why the pod was rejected,
not only the first one it encountered. As you can see, the pod was rejected for two rea-
sons: you requested two CPUs for the container, but the maximum CPU limit for a
container is one. Likewise, the pod as a whole requested two CPUs, but the maximum
is one CPU (if this was a multi-container pod, even if each individual container
requested less than the maximum amount of CPU, together they’d still need to
request less than two CPUs to pass the maximum CPU for pods).
Before you set up your LimitRange object, all your pods were created without any
resource requests or limits, but now the defaults are applied automatically when creat-
ing the pod. You can confirm this by describing the kubia-manual pod, as shown in
the following listing.
The container’s requests and limits match the ones you specified in the LimitRange
object. If you used a different LimitRange specification in another namespace, pods
created in that namespace would obviously have different requests and limits. This
allows admins to configure default, min, and max resources for pods per namespace.
Limiting the total resources available in a namespace 425
apiVersion: v1
kind: ResourceQuota
metadata:
name: cpu-and-mem
spec:
hard:
requests.cpu: 400m
requests.memory: 200Mi
limits.cpu: 600m
limits.memory: 500Mi
426 CHAPTER 14 Managing pods’ computational resources
Instead of defining a single total for each resource, you define separate totals for
requests and limits for both CPU and memory. You’ll notice the structure is a bit dif-
ferent, compared to that of a LimitRange. Here, both the requests and the limits for
all resources are defined in a single place.
This ResourceQuota sets the maximum amount of CPU pods in the namespace
can request to 400 millicores. The maximum total CPU limits in the namespace are
set to 600 millicores. For memory, the maximum total requests are set to 200 MiB,
whereas the limits are set to 500 MiB.
A ResourceQuota object applies to the namespace it’s created in, like a Limit-
Range, but it applies to all the pods’ resource requests and limits in total and not to
each individual pod or container separately, as shown in figure 14.7.
Figure 14.7 LimitRanges apply to individual pods; ResourceQuotas apply to all pods in the
namespace.
I only have the kubia-manual pod running, so the Used column matches its resource
requests and limits. When I run additional pods, their requests and limits are added to
the used amounts.
Limiting the total resources available in a namespace 427
apiVersion: v1
kind: ResourceQuota The amount of
metadata: storage claimable
name: storage overall
spec:
hard: The amount
requests.storage: 500Gi of claimable
ssd.storageclass.storage.k8s.io/requests.storage: 300Gi storage in
standard.storageclass.storage.k8s.io/requests.storage: 1Ti StorageClass ssd
plan, for example, and can also limit the number of public IPs or node ports Ser-
vices can use.
The following listing shows what a ResourceQuota object that limits the number of
objects may look like.
apiVersion: v1
kind: ResourceQuota
metadata:
name: objects
spec: Only 10 Pods, 5 ReplicationControllers,
hard: 10 Secrets, 10 ConfigMaps, and
pods: 10 4 PersistentVolumeClaims can be
replicationcontrollers: 5
created in the namespace.
secrets: 10
configmaps: 10 Five Services overall can be created,
persistentvolumeclaims: 4 of which at most one can be a
services: 5 LoadBalancer Service and at most
services.loadbalancers: 1
two can be NodePort Services.
services.nodeports: 2
ssd.storageclass.storage.k8s.io/persistentvolumeclaims: 2
The ResourceQuota in this listing allows users to create at most 10 Pods in the name-
space, regardless if they’re created manually or by a ReplicationController, Replica-
Set, DaemonSet, Job, and so on. It also limits the number of ReplicationControllers to
five. A maximum of five Services can be created, of which only one can be a LoadBal-
ancer-type Service, and only two can be NodePort Services. Similar to how the maxi-
mum amount of requested storage can be specified per StorageClass, the number of
PersistentVolumeClaims can also be limited per StorageClass.
Object count quotas can currently be set for the following objects:
Pods
ReplicationControllers
Secrets
ConfigMaps
PersistentVolumeClaims
Services (in general), and for two specific types of Services, such as Load-
Balancer Services (services.loadbalancers) and NodePort Services (ser-
vices.nodeports)
Finally, you can even set an object count quota for ResourceQuota objects themselves.
The number of other objects, such as ReplicaSets, Jobs, Deployments, Ingresses, and
so on, cannot be limited yet (but this may have changed since the book was published,
so please check the documentation for up-to-date information).
Limiting the total resources available in a namespace 429
14.5.4 Specifying quotas for specific pod states and/or QoS classes
The quotas you’ve created so far have applied to all pods, regardless of their current
state and QoS class. But quotas can also be limited to a set of quota scopes. Four scopes are
currently available: BestEffort, NotBestEffort, Terminating, and NotTerminating.
The BestEffort and NotBestEffort scopes determine whether the quota applies
to pods with the BestEffort QoS class or with one of the other two classes (that is,
Burstable and Guaranteed).
The other two scopes (Terminating and NotTerminating) don’t apply to pods
that are (or aren’t) in the process of shutting down, as the name might lead you to
believe. We haven’t talked about this, but you can specify how long each pod is
allowed to run before it’s terminated and marked as Failed. This is done by setting
the activeDeadlineSeconds field in the pod spec. This property defines the number
of seconds a pod is allowed to be active on the node relative to its start time before it’s
marked as Failed and then terminated. The Terminating quota scope applies to pods
that have the activeDeadlineSeconds set, whereas the NotTerminating applies to
those that don’t.
When creating a ResourceQuota, you can specify the scopes that it applies to. A
pod must match all the specified scopes for the quota to apply to it. Additionally, what
a quota can limit depends on the quota’s scope. BestEffort scope can only limit the
number of pods, whereas the other three scopes can limit the number of pods,
CPU/memory requests, and CPU/memory limits.
If, for example, you want the quota to apply only to BestEffort, NotTerminating
pods, you can create the ResourceQuota object shown in the following listing.
apiVersion: v1
kind: ResourceQuota
metadata:
name: besteffort-notterminating-pods
spec:
scopes: This quota only applies to pods
- BestEffort that have the BestEffort QoS and
- NotTerminating don’t have an active deadline set.
hard:
pods: 4 Only four such
pods can exist.
This quota ensures that at most four pods exist with the BestEffort QoS class,
which don’t have an active deadline. If the quota was targeting NotBestEffort pods
instead, you could also specify requests.cpu, requests.memory, limits.cpu, and
limits.memory.
NOTE Before you move on to the next section of this chapter, please delete
all the ResourceQuota and LimitRange resources you created. You won’t
430 CHAPTER 14 Managing pods’ computational resources
need them anymore and they may interfere with examples in the following
chapters.
Pod Heapster
The arrows in the figure show how the metrics data flows. They don’t show which com-
ponent connects to which to get the data. The pods (or the containers running
therein) don’t know anything about cAdvisor, and cAdvisor doesn’t know anything
about Heapster. It’s Heapster that connects to all the cAdvisors, and it’s the cAdvisors
that collect the container and node usage data without having to talk to the processes
running inside the pods’ containers.
ENABLING HEAPSTER
If you’re running a cluster in Google Kubernetes Engine, Heapster is enabled by
default. If you’re using Minikube, it’s available as an add-on and can be enabled with
the following command:
To run Heapster manually in other types of Kubernetes clusters, you can refer to
instructions located at https://github.com/kubernetes/heapster.
After enabling Heapster, you’ll need to wait a few minutes for it to collect metrics
before you can see resource usage statistics for your cluster, so be patient.
DISPLAYING CPU AND MEMORY USAGE FOR CLUSTER NODES
Running Heapster in your cluster makes it possible to obtain resource usages for
nodes and individual pods through the kubectl top command. To see how much
CPU and memory is being used on your nodes, you can run the command shown in
the following listing.
This shows the actual, current CPU and memory usage of all the pods running on the
node, unlike the kubectl describe node command, which shows the amount of CPU
and memory requests and limits instead of actual runtime usage data.
DISPLAYING CPU AND MEMORY USAGE FOR INDIVIDUAL PODS
To see how much each individual pod is using, you can use the kubectl top pod com-
mand, as shown in the following listing.
The outputs of both these commands are fairly simple, so you probably don’t need me
to explain them, but I do need to warn you about one thing. Sometimes the top pod
command will refuse to show any metrics and instead print out an error like this:
If this happens, don’t start looking for the cause of the error yet. Relax, wait a while,
and rerun the command—it may take a few minutes, but the metrics should appear
eventually. The kubectl top command gets the metrics from Heapster, which aggre-
gates the data over a few minutes and doesn’t expose it immediately.
TIP To see resource usages across individual containers instead of pods, you
can use the --containers option.
Figure 14.9 Grafana dashboard showing CPU usage across the cluster
When using Minikube, Grafana’s web console is exposed through a NodePort Service,
so you can open it in your browser with the following command:
$ minikube service monitoring-grafana -n kube-system
Opening kubernetes service kube-system/monitoring-grafana in default
browser...
A new browser window or tab will open and show the Grafana Home screen. On the
right-hand side, you’ll see a list of dashboards containing two entries:
Cluster
Pods
To see the resource usage statistics of the nodes, open the Cluster dashboard. There
you’ll see several charts showing the overall cluster usage, usage by node, and the
individual usage for CPU, memory, network, and filesystem. The charts will not only
show the actual usage, but also the requests and limits for those resources (where
they apply).
If you then switch over to the Pods dashboard, you can examine the resource
usages for each individual pod, again with both requests and limits shown alongside
the actual usage.
Initially, the charts show the statistics for the last 30 minutes, but you can zoom out
and see the data for much longer time periods: days, months, or even years.
USING THE INFORMATION SHOWN IN THE CHARTS
By looking at the charts, you can quickly see if the resource requests or limits you’ve
set for your pods need to be raised or whether they can be lowered to allow more pods
to fit on your nodes. Let’s look at an example. Figure 14.10 shows the CPU and mem-
ory charts for a pod.
At the far right of the top chart, you can see the pod is using more CPU than was
requested in the pod’s manifest. Although this isn’t problematic when this is the only
pod running on the node, you should keep in mind that a pod is only guaranteed as
much of a resource as it requests through resource requests. Your pod may be running
fine now, but when other pods are deployed to the same node and start using the
CPU, your pod’s CPU time may be throttled. Because of this, to ensure the pod can
use as much CPU as it needs to at any time, you should raise the CPU resource request
for the pod’s container.
The bottom chart shows the pod’s memory usage and request. Here the situation is
the exact opposite. The amount of memory the pod is using is well below what was
requested in the pod’s spec. The requested memory is reserved for the pod and won’t
be available to other pods. The unused memory is therefore wasted. You should
decrease the pod’s memory request to make the memory available to other pods run-
ning on the node.
Summary 435
CPU usage
CPU request
Memory request
Memory usage
14.7 Summary
This chapter has shown you that you need to consider your pod’s resource usage and
configure both the resource requests and the limits for your pod to keep everything
running smoothly. The key takeaways from this chapter are
Specifying resource requests helps Kubernetes schedule pods across the cluster.
Specifying resource limits keeps pods from starving other pods of resources.
Unused CPU time is allocated based on containers’ CPU requests.
Containers never get killed if they try to use too much CPU, but they are killed
if they try to use too much memory.
In an overcommitted system, containers also get killed to free memory for more
important pods, based on the pods’ QoS classes and actual memory usage.
436 CHAPTER 14 Managing pods’ computational resources
You can use LimitRange objects to define the minimum, maximum, and default
resource requests and limits for individual pods.
You can use ResourceQuota objects to limit the amount of resources available
to all the pods in a namespace.
To know how high to set a pod’s resource requests and limits, you need to mon-
itor how the pod uses resources over a long-enough time period.
In the next chapter, you’ll see how these metrics can be used by Kubernetes to auto-
matically scale your pods.
Automatic scaling
of pods and cluster nodes
437
438 CHAPTER 15 Automatic scaling of pods and cluster nodes
Luckily, Kubernetes can monitor your pods and scale them up automatically as
soon as it detects an increase in the CPU usage or some other metric. If running on a
cloud infrastructure, it can even spin up additional nodes if the existing ones can’t
accept any more pods. This chapter will explain how to get Kubernetes to do both pod
and node autoscaling.
The autoscaling feature in Kubernetes was completely rewritten between the 1.6
and the 1.7 version, so be aware you may find outdated information on this subject
online.
This implies that Heapster must be running in the cluster for autoscaling to work. If
you’re using Minikube and were following along in the previous chapter, Heapster
Horizontal pod autoscaling 439
should already be enabled in your cluster. If not, make sure to enable the Heapster
add-on before trying out any autoscaling examples.
Although you don’t need to query Heapster directly, if you’re interested in doing
so, you’ll find both the Heapster Pod and the Service it’s exposed through in the
kube-system namespace.
The core API server will not expose the metrics itself. From version 1.7, Kubernetes
allows registering multiple API servers and making them appear as a single API
server. This allows it to expose metrics through one of those underlying API servers.
We’ll explain API server aggregation in the last chapter.
Selecting what metrics collector to use in their clusters will be up to cluster adminis-
trators. A simple translation layer is usually required to expose the metrics in the
appropriate API paths and in the appropriate format.
CPU
utilization 60% 90% 50%
QPS 15 30 12
Deployment, ReplicaSet,
StatefulSet, or
ReplicationController
Figure 15.3 The Horizontal Pod Autoscaler modifies only on the Scale sub-resource.
This allows the Autoscaler to operate on any scalable resource, as long as the API
server exposes the Scale sub-resource for it. Currently, it’s exposed for
Deployments
ReplicaSets
ReplicationControllers
StatefulSets
These are currently the only objects you can attach an Autoscaler to.
Horizontal pod autoscaling 441
Node 1
Pod
Autoscaler adjusts
Horizontal Pod replicas (++ or --)
Deployment ReplicaSet Pod
Autoscaler
cAdvisor
Kubelet
Node X
Heapster Node 2
Autoscaler collects Heapster collects
metrics from Heapster metrics from all nodes Pod
cAdvisor
Kubelet
Figure 15.4 How the autoscaler obtains metrics and rescales the target deployment
The arrows leading from the pods to the cAdvisors, which continue on to Heapster
and finally to the Horizontal Pod Autoscaler, indicate the direction of the flow of met-
rics data. It’s important to be aware that each component gets the metrics from the
other components periodically (that is, cAdvisor gets the metrics from the pods in a
continuous loop; the same is also true for Heapster and for the HPA controller). The
end effect is that it takes quite a while for the metrics data to be propagated and a res-
caling action to be performed. It isn’t immediate. Keep this in mind when you observe
the Autoscaler in action next.
we’re only focusing on scaling out (increasing the number of pods). By doing that,
the average CPU usage should come down.
Because CPU usage is usually unstable, it makes sense to scale out even before the
CPU is completely swamped—perhaps when the average CPU load across the pods
reaches or exceeds 80%. But 80% of what, exactly?
TIP Always set the target CPU usage well below 100% (and definitely never
above 90%) to leave enough room for handling sudden load spikes.
As you may remember from the previous chapter, the process running inside a con-
tainer is guaranteed the amount of CPU requested through the resource requests
specified for the container. But at times when no other processes need CPU, the pro-
cess may use all the available CPU on the node. When someone says a pod is consum-
ing 80% of the CPU, it’s not clear if they mean 80% of the node’s CPU, 80% of the
pod’s guaranteed CPU (the resource request), or 80% of the hard limit configured
for the pod through resource limits.
As far as the Autoscaler is concerned, only the pod’s guaranteed CPU amount (the
CPU requests) is important when determining the CPU utilization of a pod. The Auto-
scaler compares the pod’s actual CPU consumption and its CPU requests, which
means the pods you’re autoscaling need to have CPU requests set (either directly or
indirectly through a LimitRange object) for the Autoscaler to determine the CPU uti-
lization percentage.
CREATING A HORIZONTALPODAUTOSCALER BASED ON CPU USAGE
Let’s see how to create a HorizontalPodAutoscaler now and configure it to scale pods
based on their CPU utilization. You’ll create a Deployment similar to the one in chap-
ter 9, but as we’ve discussed, you’ll need to make sure the pods created by the Deploy-
ment all have the CPU resource requests specified in order to make autoscaling
possible. You’ll have to add a CPU resource request to the Deployment’s pod tem-
plate, as shown in the following listing.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: kubia Manually setting the
spec: (initial) desired number
replicas: 3 of replicas to three
template:
metadata:
name: kubia
labels:
app: kubia
spec:
containers: Running the
- image: luksa/kubia:v1 kubia:v1 image
name: nodejs
Horizontal pod autoscaling 443
resources:
requests: Requesting 100 millicores
cpu: 100m of CPU per pod
This is a regular Deployment object—it doesn’t use autoscaling yet. It will run three
instances of the kubia NodeJS app, with each instance requesting 100 millicores
of CPU.
After creating the Deployment, to enable horizontal autoscaling of its pods, you
need to create a HorizontalPodAutoscaler (HPA) object and point it to the Deploy-
ment. You could prepare and post the YAML manifest for the HPA, but an easier way
exists—using the kubectl autoscale command:
$ kubectl autoscale deployment kubia --cpu-percent=30 --min=1 --max=5
deployment "kubia" autoscaled
This creates the HPA object for you and sets the Deployment called kubia as the scal-
ing target. You’re setting the target CPU utilization of the pods to 30% and specifying
the minimum and maximum number of replicas. The Autoscaler will constantly keep
adjusting the number of replicas to keep their CPU utilization around 30%, but it will
never scale down to less than one or scale up to more than five replicas.
status:
currentMetrics: []
currentReplicas: 3
The current status
of the Autoscaler
desiredReplicas: 0
Because you’re running three pods that are currently receiving no requests, which
means their CPU usage should be close to zero, you should expect the Autoscaler to
scale them down to a single pod, because even with a single pod, the CPU utilization
will still be below the 30% target.
And sure enough, the autoscaler does exactly that. It soon scales the Deployment
down to a single replica:
$ kubectl get deployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
kubia 1 1 1 1 23m
Remember, the autoscaler only adjusts the desired replica count on the Deployment.
The Deployment controller then takes care of updating the desired replica count on
the ReplicaSet object, which then causes the ReplicaSet controller to delete two excess
pods, leaving one pod running.
You can use kubectl describe to see more information on the HorizontalPod-
Autoscaler and the operation of the underlying controller, as the following listing shows.
Events:
From Reason Message
---- ------ ---
horizontal-pod-autoscaler SuccessfulRescale New size: 1; reason: All
metrics below target
Turn your focus to the table of events at the bottom of the listing. You see the horizon-
tal pod autoscaler has successfully rescaled to one replica, because all metrics were
below target.
TRIGGERING A SCALE-UP
You’ve already witnessed your first automatic rescale event (a scale-down). Now, you’ll
start sending requests to your pod, thereby increasing its CPU usage, and you should
see the autoscaler detect this and start up additional pods.
You’ll need to expose the pods through a Service, so you can hit all of them through
a single URL. You may remember that the easiest way to do that is with kubectl expose:
$ kubectl expose deployment kubia --port=80 --target-port=8080
service "kubia" exposed
Before you start hitting your pod(s) with requests, you may want to run the follow-
ing command in a separate terminal to keep an eye on what’s happening with the
HorizontalPodAutoscaler and the Deployment, as shown in the following listing.
TIP List multiple resource types with kubectl get by delimiting them with
a comma.
If you’re using OSX, you’ll have to replace the watch command with a loop, manually
run kubectl get periodically, or use kubectl’s --watch option. But although a plain
kubectl get can show multiple types of resources at once, that’s not the case when
using the aforementioned --watch option, so you’ll need to use two terminals if you
want to watch both the HPA and the Deployment objects.
Keep an eye on the state of those two objects while you run a load-generating pod.
You’ll run the following command in another terminal:
$ kubectl run -it --rm --restart=Never loadgenerator --image=busybox
➥ -- sh -c "while true; do wget -O - -q http://kubia.default; done"
446 CHAPTER 15 Automatic scaling of pods and cluster nodes
This will run a pod which repeatedly hits the kubia Service. You’ve seen the -it
option a few times when running the kubectl exec command. As you can see, it can
also be used with kubectl run. It allows you to attach the console to the process,
which will not only show you the process’ output directly, but will also terminate the
process as soon as you press CTRL+C. The --rm option causes the pod to be deleted
afterward, and the --restart=Never option causes kubectl run to create an unman-
aged pod directly instead of through a Deployment object, which you don’t need.
This combination of options is useful for running commands inside the cluster with-
out having to piggyback on an existing pod. It not only behaves the same as if you
were running the command locally, it even cleans up everything when the command
terminates.
SEEING THE AUTOSCALER SCALE UP THE DEPLOYMENT
As the load-generator pod runs, you’ll see it initially hitting the single pod. As before,
it takes time for the metrics to be updated, but when they are, you’ll see the autoscaler
increase the number of replicas. In my case, the pod’s CPU utilization initially jumped
to 108%, which caused the autoscaler to increase the number of pods to four. The
utilization on the individual pods then decreased to 74% and then stabilized at
around 26%.
NOTE If the CPU load in your case doesn’t exceed 30%, try running addi-
tional load-generators.
Again, you can inspect autoscaler events with kubectl describe to see what the
autoscaler has done (only the most important information is shown in the following
listing).
Does it strike you as odd that the initial average CPU utilization in my case, when I
only had one pod, was 108%, which is more than 100%? Remember, a container’s
CPU utilization is the container’s actual CPU usage divided by its requested CPU. The
requested CPU defines the minimum, not maximum amount of CPU available to the
container, so a container may consume more than the requested CPU, bringing the
percentage over 100.
Before we go on, let’s do a little math and see how the autoscaler concluded that
four replicas are needed. Initially, there was one replica handling requests and its
CPU usage spiked to 108%. Dividing 108 by 30 (the target CPU utilization percent-
age) gives 3.6, which the autoscaler then rounded up to 4. If you divide 108 by 4, you
Horizontal pod autoscaling 447
get 27%. If the autoscaler scales up to four pods, their average CPU utilization is
expected to be somewhere in the neighborhood of 27%, which is close to the target
value of 30% and almost exactly what the observed CPU utilization was.
UNDERSTANDING THE MAXIMUM RATE OF SCALING
In my case, the CPU usage shot up to 108%, but in general, the initial CPU usage
could spike even higher. Even if the initial average CPU utilization was higher (say
150%), requiring five replicas to achieve the 30% target, the autoscaler would still
only scale up to four pods in the first step, because it has a limit on how many repli-
cas can be added in a single scale-up operation. The autoscaler will at most double
the number of replicas in a single operation, if more than two current replicas
exist. If only one or two exist, it will scale up to a maximum of four replicas in a sin-
gle step.
Additionally, it has a limit on how soon a subsequent autoscale operation can
occur after the previous one. Currently, a scale-up will occur only if no rescaling
event occurred in the last three minutes. A scale-down event is performed even less
frequently—every five minutes. Keep this in mind so you don’t wonder why the
autoscaler refuses to perform a rescale operation even if the metrics clearly show
that it should.
MODIFYING THE TARGET METRIC VALUE ON AN EXISTING HPA OBJECT
To wrap up this section, let’s do one last exercise. Maybe your initial CPU utilization
target of 30% was a bit too low, so increase it to 60%. You do this by editing the HPA
resource with the kubectl edit command. When the text editor opens, change the
targetAverageUtilization field to 60, as shown in the following listing.
Listing 15.6 Increasing the target CPU utilization by editing the HPA resource
...
spec:
maxReplicas: 5
metrics:
- resource:
name: cpu Change this
targetAverageUtilization: 60 from 30 to 60.
type: Resource
...
As with most other resources, after you modify the resource, your changes will be
detected by the autoscaler controller and acted upon. You could also delete the
resource and recreate it with different target values, because by deleting the HPA
resource, you only disable autoscaling of the target resource (a Deployment in this
case) and leave it at the scale it is at that time. The automatic scaling will resume after
you create a new HPA resource for the Deployment.
448 CHAPTER 15 Automatic scaling of pods and cluster nodes
As you can see, the metrics field allows you to define more than one metric to use.
In the listing, you’re using a single metric. Each entry defines the type of metric—
in this case, a Resource metric. You have three types of metrics you can use in an
HPA object:
Resource
Pods
Object
The example in the listing configures the autoscaler to keep the average QPS of all
the pods managed by the ReplicaSet (or other) controller targeted by this HPA
resource at 100.
UNDERSTANDING THE OBJECT METRIC TYPE
The Object metric type is used when you want to make the autoscaler scale pods
based on a metric that doesn’t pertain directly to those pods. For example, you may
want to scale pods according to a metric of another cluster object, such as an Ingress
object. The metric could be QPS as in listing 15.8, the average request latency, or
something else completely.
Unlike in the previous case, where the autoscaler needed to obtain the metric for
all targeted pods and then use the average of those values, when you use an Object
metric type, the autoscaler obtains a single metric from the single object. In the HPA
450 CHAPTER 15 Automatic scaling of pods and cluster nodes
definition, you need to specify the target object and the target value. The following
listing shows an example.
...
spec: Use metric of a
metrics: specific object
The
Autoscaler - type: Object
should resource: The name of
scale so metricName: latencyMillis the metric
the value target:
of the apiVersion: extensions/v1beta1
metric kind: Ingress
The specific object whose metric
stays close the autoscaler should obtain
name: frontend
to this. targetValue: 20
scaleTargetRef:
apiVersion: extensions/v1beta1 The scalable resource the
kind: Deployment autoscaler will scale
name: kubia
...
In this example, the HPA is configured to use the latencyMillis metric of the
frontend Ingress object. The target value for the metric is 20. The horizontal pod
autoscaler will monitor the Ingress’ metric and if it rises too far above the target value,
the autoscaler will scale the kubia Deployment resource.
anything. Allowing the number of pods to be scaled down to zero can dramatically
increase the utilization of your hardware. When you run services that get requests only
once every few hours or even days, it doesn’t make sense to have them running all the
time, eating up resources that could be used by other pods. But you still want to have
those services available immediately when a client request comes in.
This is known as idling and un-idling. It allows pods that provide a certain service
to be scaled down to zero. When a new request comes in, the request is blocked until
the pod is brought up and then the request is finally forwarded to the pod.
Kubernetes currently doesn’t provide this feature yet, but it will eventually. Check
the documentation to see if idling has been implemented yet.
Figure 15.5 The Cluster Autoscaler scales up when it finds a pod that can’t be scheduled to
existing nodes.
When the new node starts up, the Kubelet on that node contacts the API server and
registers the node by creating a Node resource. From then on, the node is part of the
Kubernetes cluster and pods can be scheduled to it.
Simple, right? What about scaling down?
RELINQUISHING NODES
The Cluster Autoscaler also needs to scale down the number of nodes when they
aren’t being utilized enough. The Autoscaler does this by monitoring the requested
CPU and memory on all the nodes. If the CPU and memory requests of all the pods
running on a given node are below 50%, the node is considered unnecessary.
That’s not the only determining factor in deciding whether to bring a node down.
The Autoscaler also checks to see if any system pods are running (only) on that node
(apart from those that are run on every node, because they’re deployed by a Daemon-
Set, for example). If a system pod is running on a node, the node won’t be relinquished.
The same is also true if an unmanaged pod or a pod with local storage is running on the
node, because that would cause disruption to the service the pod is providing. In other
words, a node will only be returned to the cloud provider if the Cluster Autoscaler
knows the pods running on the node will be rescheduled to other nodes.
When a node is selected to be shut down, the node is first marked as unschedula-
ble and then all the pods running on the node are evicted. Because all those pods
belong to ReplicaSets or other controllers, their replacements are created and sched-
uled to the remaining nodes (that’s why the node that’s being shut down is first
marked as unschedulable).
454 CHAPTER 15 Automatic scaling of pods and cluster nodes
How you start the Autoscaler depends on where your Kubernetes cluster is running.
For your kubia cluster running on GKE, you can enable the Cluster Autoscaler like
this:
$ gcloud container clusters update kubia --enable-autoscaling \
--min-nodes=3 --max-nodes=5
If your cluster is running on GCE, you need to set three environment variables before
running kube-up.sh:
KUBE_ENABLE_CLUSTER_AUTOSCALER=true
KUBE_AUTOSCALER_MIN_NODES=3
KUBE_AUTOSCALER_MAX_NODES=5
Certain services require that a minimum number of pods always keeps running;
this is especially true for quorum-based clustered applications. For this reason, Kuber-
netes provides a way of specifying the minimum number of pods that need to keep
running while performing these types of operations. This is done by creating a Pod-
DisruptionBudget resource.
Even though the name of the resource sounds complex, it’s one of the simplest
Kubernetes resources available. It contains only a pod label selector and a number
specifying the minimum number of pods that must always be available or, starting
from Kubernetes version 1.7, the maximum number of pods that can be unavailable.
We’ll look at what a PodDisruptionBudget (PDB) resource manifest looks like, but
instead of creating it from a YAML file, you’ll create it with kubectl create pod-
disruptionbudget and then obtain and examine the YAML later.
If you want to ensure three instances of your kubia pod are always running (they
have the label app=kubia), create the PodDisruptionBudget resource like this:
$ kubectl create pdb kubia-pdb --selector=app=kubia --min-available=3
poddisruptionbudget "kubia-pdb" created
Simple, right? Now, retrieve the PDB’s YAML. It’s shown in the next listing.
You can also use a percentage instead of an absolute number in the minAvailable
field. For example, you could state that 60% of all pods with the app=kubia label need
to be running at all times.
We don’t have much more to say about this resource. As long as it exists, both the
Cluster Autoscaler and the kubectl drain command will adhere to it and will never
evict a pod with the app=kubia label if that would bring the number of such pods
below three.
456 CHAPTER 15 Automatic scaling of pods and cluster nodes
For example, if there were four pods altogether and minAvailable was set to three
as in the example, the pod eviction process would evict pods one by one, waiting for
the evicted pod to be replaced with a new one by the ReplicaSet controller, before
evicting another pod.
15.4 Summary
This chapter has shown you how Kubernetes can scale not only your pods, but also
your nodes. You’ve learned that
Configuring the automatic horizontal scaling of pods is as easy as creating a
HorizontalPodAutoscaler object and pointing it to a Deployment, ReplicaSet,
or ReplicationController and specifying the target CPU utilization for the pods.
Besides having the Horizontal Pod Autoscaler perform scaling operations based
on the pods’ CPU utilization, you can also configure it to scale based on your
own application-provided custom metrics or metrics related to other objects
deployed in the cluster.
Vertical pod autoscaling isn’t possible yet.
Even cluster nodes can be scaled automatically if your Kubernetes cluster runs
on a supported cloud provider.
You can run one-off processes in a pod and have the pod stopped and deleted
automatically as soon you press CTRL+C by using kubectl run with the -it and
--rm options.
In the next chapter, you’ll explore advanced scheduling features, such as how to keep
certain pods away from certain nodes and how to schedule pods either close together
or apart.
Advanced scheduling
Kubernetes allows you to affect where pods are scheduled. Initially, this was only
done by specifying a node selector in the pod specification, but additional mech-
anisms were later added that expanded this functionality. They’re covered in this
chapter.
457
458 CHAPTER 16 Advanced scheduling
which pods can use a certain node. A pod can only be scheduled to a node if it toler-
ates the node’s taints.
This is somewhat different from using node selectors and node affinity, which
you’ll learn about later in this chapter. Node selectors and node affinity rules make
it possible to select which nodes a pod can or can’t be scheduled to by specifically
adding that information to the pod, whereas taints allow rejecting deployment of
pods to certain nodes by only adding taints to the node without having to modify
existing pods. Pods that you want deployed on a tainted node need to opt in to use
the node, whereas with node selectors, pods explicitly specify which node(s) they
want to be deployed to.
Listing 16.1 Describing the master node in a cluster created with kubeadm
The master node has a single taint. Taints have a key, value, and an effect, and are repre-
sented as <key>=<value>:<effect>. The master node’s taint shown in the previous
listing has the key node-role.kubernetes.io/master, a null value (not shown in the
taint), and the effect of NoSchedule.
This taint prevents pods from being scheduled to the master node, unless those pods
tolerate this taint. The pods that tolerate it are usually system pods (see figure 16.1).
Using taints and tolerations to repel pods from certain nodes 459
Taint: No taints
node-role.kubernetes.io
/master:NoSchedule
System pod may be
scheduled to master
node because its
toleration matches
the node’s taint. System pod Regular pod
Toleration: No tolerations
node-role.kubernetes.io
/master:NoSchedule
Figure 16.1 A pod is only scheduled to a node if it tolerates the node’s taints.
As you can see, the first toleration matches the master node’s taint, allowing this kube-
proxy pod to be scheduled to the master node.
NOTE Disregard the equal sign, which is shown in the pod’s tolerations, but
not in the node’s taints. Kubectl apparently displays taints and tolerations dif-
ferently when the taint’s/toleration’s value is null.
UNDERSTANDING TAINT EFFECTS
The two other tolerations on the kube-proxy pod define how long the pod is allowed
to run on nodes that aren’t ready or are unreachable (the time in seconds isn’t shown,
460 CHAPTER 16 Advanced scheduling
but can be seen in the pod’s YAML). Those two tolerations refer to the NoExecute
instead of the NoSchedule effect.
Each taint has an effect associated with it. Three possible effects exist:
NoSchedule, which means pods won’t be scheduled to the node if they don’t tol-
erate the taint.
PreferNoSchedule is a soft version of NoSchedule, meaning the scheduler will
try to avoid scheduling the pod to the node, but will schedule it to the node if it
can’t schedule it somewhere else.
NoExecute, unlike NoSchedule and PreferNoSchedule that only affect schedul-
ing, also affects pods already running on the node. If you add a NoExecute taint
to a node, pods that are already running on that node and don’t tolerate the
NoExecute taint will be evicted from the node.
This adds a taint with key node-type, value production and the NoSchedule effect. If
you now deploy multiple replicas of a regular pod, you’ll see none of them are sched-
uled to the node you tainted, as shown in the following listing.
Now, no one can inadvertently deploy pods onto the production nodes.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: prod
spec:
replicas: 5
template:
spec:
...
tolerations:
- key: node-type
Operator: Equal This toleration allows the
value: production
pod to be scheduled to
production nodes.
effect: NoSchedule
If you deploy this Deployment, you’ll see its pods get deployed to the production
node, as shown in the next listing.
Listing 16.5 Pods with the toleration are deployed on production node1
As you can see in the listing, production pods were also deployed to node2, which isn’t
a production node. To prevent that from happening, you’d also need to taint the non-
production nodes with a taint such as node-type=non-production:NoSchedule. Then
you’d also need to add the matching toleration to all your non-production pods.
when several of your nodes provide special hardware and only part of your pods need
to use it.
CONFIGURING HOW LONG AFTER A NODE FAILURE A POD IS RESCHEDULED
You can also use a toleration to specify how long Kubernetes should wait before
rescheduling a pod to another node if the node the pod is running on becomes
unready or unreachable. If you look at the tolerations of one of your pods, you’ll see
two tolerations, which are shown in the following listing.
These two tolerations say that this pod tolerates a node being notReady or unreach-
able for 300 seconds. The Kubernetes Control Plane, when it detects that a node is no
longer ready or no longer reachable, will wait for 300 seconds before it deletes the
pod and reschedules it to another node.
These two tolerations are automatically added to pods that don’t define them. If
that five-minute delay is too long for your pods, you can make the delay shorter by
adding those two tolerations to the pod’s spec.
Node selectors will eventually be deprecated, so it’s important you understand the
new node affinity rules.
Similar to node selectors, each pod can define its own node affinity rules. These
allow you to specify either hard requirements or preferences. By specifying a prefer-
ence, you tell Kubernetes which nodes you prefer for a specific pod, and Kubernetes
will try to schedule the pod to one of those nodes. If that’s not possible, it will choose
one of the other nodes.
EXAMINING THE DEFAULT NODE LABELS
Node affinity selects nodes based on their labels, the same way node selectors do.
Before you see how to use node affinity, let’s examine the labels of one of the nodes in
a Google Kubernetes Engine cluster (GKE) to see what the default node labels are.
They’re shown in the following listing.
The node has many labels, but the last three are the most important when it comes to
node affinity and pod affinity, which you’ll learn about later. The meaning of those
three labels is as follows:
failure-domain.beta.kubernetes.io/region specifies the geographical region
the node is located in.
failure-domain.beta.kubernetes.io/zone specifies the availability zone the
node is in.
kubernetes.io/hostname is obviously the node’s hostname.
These and other labels can be used in pod affinity rules. In chapter 3, you already
learned how you can add a custom label to nodes and use it in a pod’s node selector.
You used the custom label to deploy pods only to nodes with that label by adding a node
selector to the pods. Now, you’ll see how to do the same using node affinity rules.
apiVersion: v1
kind: Pod
metadata:
name: kubia-gpu
spec: This pod is only scheduled
nodeSelector: to nodes that have the
gpu: "true" gpu=true label.
...
The nodeSelector field specifies that the pod should only be deployed on nodes that
include the gpu=true label. If you replace the node selector with a node affinity rule,
the pod definition will look like the following listing.
apiVersion: v1
kind: Pod
metadata:
name: kubia-gpu
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: In
values:
- "true"
The first thing you’ll notice is that this is much more complicated than a simple node
selector. But that’s because it’s much more expressive. Let’s examine the rule in detail.
MAKING SENSE OF THE LONG NODEAFFINITY ATTRIBUTE NAME
As you can see, the pod’s spec section contains an affinity field that contains a node-
Affinity field, which contains a field with an extremely long name, so let’s focus on
that first.
Let’s break it down into two parts and examine what they mean:
requiredDuringScheduling... means the rules defined under this field spec-
ify the labels the node must have for the pod to be scheduled to the node.
...IgnoredDuringExecution means the rules defined under the field don’t
affect pods already executing on the node.
At this point, let me make things easier for you by letting you know that affinity cur-
rently only affects pod scheduling and never causes a pod to be evicted from a node.
That’s why all the rules right now always end with IgnoredDuringExecution. Eventu-
ally, Kubernetes will also support RequiredDuringExecution, which means that if you
Using node affinity to attract pods to certain nodes 465
remove a label from a node, pods that require the node to have that label will be
evicted from such a node. As I’ve said, that’s not yet supported in Kubernetes, so let’s
not concern ourselves with the second part of that long field any longer.
UNDERSTANDING NODESELECTORTERMS
By keeping what was explained in the previous section in mind, it’s easy to understand
that the nodeSelectorTerms field and the matchExpressions field define which
expressions the node’s labels must match for the pod to be scheduled to the node.
The single expression in the example is simple to understand. The node must have a
gpu label whose value is set to true.
This pod will therefore only be scheduled to nodes that have the gpu=true label, as
shown in figure 16.2.
Node with a GPU Node with a GPU Node without a GPU Node without a GPU
Pod Pod
Figure 16.2 A pod’s node affinity specifies which labels a node must have for the pod to be
scheduled to it.
Now comes the more interesting part. Node also affinity allows you to prioritize nodes
during scheduling. We’ll look at that next.
machines reserved for your company’s deployments. If those machines don’t have
enough room for the pods or if other important reasons exist that prevent them from
being scheduled there, you’re okay with them being scheduled to the machines your
partners use and to the other zones. Node affinity allows you to do that.
LABELING NODES
First, the nodes need to be labeled appropriately. Each node needs to have a label that
designates the availability zone the node belongs to and a label marking it as either a
dedicated or a shared node.
Appendix B explains how to set up a three-node cluster (one master and two
worker nodes) in VMs running locally. In the following examples, I’ll use the two worker
nodes in that cluster, but you can also use Google Kubernetes Engine or any other
multi-node cluster.
NOTE Minikube isn’t the best choice for running these examples, because it
runs only one node.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: pref
spec:
template:
...
spec:
affinity:
nodeAffinity:
Using node affinity to attract pods to certain nodes 467
preferredDuringSchedulingIgnoredDuringExecution:
You’re
specifying - weight: 80
preferences, preference:
matchExpressions:
You prefer the pod to be
not hard scheduled to zone1. This
requirements. - key: availability-zone
is your most important
operator: In
preference.
values:
- zone1
- weight: 20
preference: You also prefer that your
matchExpressions: pods be scheduled to
- key: share-type dedicated nodes, but this is
operator: In four times less important
values: than your zone preference.
- dedicated
...
Let’s examine the listing closely. You’re defining a node affinity preference, instead of
a hard requirement. You want the pods scheduled to nodes that include the labels
availability-zone=zone1 and share-type=dedicated. You’re saying that the first
preference rule is important by setting its weight to 80, whereas the second one is
much less important (weight is set to 20).
UNDERSTANDING HOW NODE PREFERENCES WORK
If your cluster had many nodes, when scheduling the pods of the Deployment in the
previous listing, the nodes would be split into four groups, as shown in figure 16.3.
Nodes whose availability-zone and share-type labels match the pod’s node affin-
ity are ranked the highest. Then, because of how the weights in the pod’s node affinity
rules are configured, next come the shared nodes in zone1, then come the dedicated
nodes in the other zones, and at the lowest priority are all the other nodes.
Pod
Out of the five pods that were created, four of them landed on node1 and only one
landed on node2. Why did one of them land on node2 instead of node1? The reason is
that besides the node affinity prioritization function, the Scheduler also uses other pri-
oritization functions to decide where to schedule a pod. One of those is the Selector-
SpreadPriority function, which makes sure pods belonging to the same ReplicaSet or
Service are spread around different nodes so a node failure won’t bring the whole ser-
vice down. That’s most likely what caused one of the pods to be scheduled to node2.
You can try scaling the Deployment up to 20 or more and you’ll see the majority of
pods will be scheduled to node1. In my test, only two out of the 20 were scheduled to
node2. If you hadn’t defined any node affinity preferences, the pods would have been
spread around the two nodes evenly.
This Deployment is not special in any way. The only thing you need to note is the
app=backend label you added to the pod using the -l option. This label is what you’ll
use in the frontend pod’s podAffinity configuration.
SPECIFYING POD AFFINITY IN A POD DEFINITION
The frontend pod’s definition is shown in the following listing.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: frontend
spec: Defining
replicas: 5 podAffinity rules
template:
...
Defining a hard
spec:
requirement, not
affinity: a preference
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname The pods of this Deployment
labelSelector: must be deployed on the
matchLabels: same node as the pods that
app: backend match the selector.
...
The listing shows that this Deployment will create pods that have a hard requirement
to be deployed on the same node (specified by the topologyKey field) as pods that
have the app=backend label (see figure 16.4).
app: backend
Backend
pod
Figure 16.4 Pod affinity allows scheduling pods to the node where other pods
with a specific label are.
470 CHAPTER 16 Advanced scheduling
NOTE Instead of the simpler matchLabels field, you could also use the more
expressive matchExpressions field.
DEPLOYING A POD WITH POD AFFINITY
Before you create this Deployment, let’s see which node the backend pod was sched-
uled to earlier:
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE
backend-257820-qhqj6 1/1 Running 0 8m 10.47.0.1 node2.k8s
When you create the frontend pods, they should be deployed to node2 as well. You’re
going to create the Deployment and see where the pods are deployed. This is shown
in the next listing.
Listing 16.14 Deploying frontend pods and seeing which node they’re scheduled to
All the frontend pods were indeed scheduled to the same node as the backend pod.
When scheduling the frontend pod, the Scheduler first found all the pods that match
the labelSelector defined in the frontend pod’s podAffinity configuration and
then scheduled the frontend pod to the same node.
UNDERSTANDING HOW THE SCHEDULER USES POD AFFINITY RULES
What’s interesting is that if you now delete the backend pod, the Scheduler will sched-
ule the pod to node2 even though it doesn’t define any pod affinity rules itself (the
rules are only on the frontend pods). This makes sense, because otherwise if the back-
end pod were to be deleted by accident and rescheduled to a different node, the fron-
tend pods’ affinity rules would be broken.
You can confirm the Scheduler takes other pods’ pod affinity rules into account, if
you increase the Scheduler’s logging level and then check its log. The following listing
shows the relevant log lines.
Listing 16.15 Scheduler log showing why the backend pod is scheduled to node2
If you focus on the two lines in bold, you’ll see that during the scheduling of the back-
end pod, node2 received a higher score than node1 because of inter-pod affinity.
Rack 1 Rack 2
Frontend pods will be
scheduled to nodes in
rack: rack1 rack: rack2 the same rack as the
backend pod.
Node 1 Node 11
... ...
Node 10 Node 20
Figure 16.5 The topologyKey in podAffinity determines the scope of where the pod
should be scheduled to.
matches the values of the pods it found earlier. In figure 16.5, the label selector
matched the backend pod, which runs on Node 12. The value of the rack label on
that node equals rack2, so when scheduling a frontend pod, the Scheduler will only
select among the nodes that have the rack=rack2 label.
NOTE By default, the label selector only matches pods in the same name-
space as the pod that’s being scheduled. But you can also select pods from
other namespaces by adding a namespaces field at the same level as label-
Selector.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: frontend
spec:
replicas: 5
template:
...
spec: Preferred
affinity: instead of
podAffinity: Required
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
podAffinityTerm: A weight and a
topologyKey: kubernetes.io/hostname podAffinity term is
labelSelector: specified as in the
matchLabels: previous example
app: backend
containers: ...
As in nodeAffinity preference rules, you need to define a weight for each rule. You
also need to specify the topologyKey and labelSelector, as in the hard-requirement
podAffinity rules. Figure 16.6 shows this scenario.
Node 1 Node 2
app: backend
Backend
pod
Figure 16.6 Pod affinity can be used to make the Scheduler prefer nodes where
pods with a certain label are running.
Deploying this pod, as with your nodeAffinity example, deploys four pods on the same
node as the backend pod, and one pod on the other node (see the following listing).
474 CHAPTER 16 Advanced scheduling
16.3.4 Scheduling pods away from each other with pod anti-affinity
You’ve seen how to tell the Scheduler to co-locate pods, but sometimes you may want
the exact opposite. You may want to keep pods away from each other. This is called
pod anti-affinity. It’s specified the same way as pod affinity, except that you use the
podAntiAffinity property instead of podAffinity, which results in the Scheduler
never choosing nodes where pods matching the podAntiAffinity’s label selector are
running, as shown in figure 16.7.
app: foo
Pod: foo
Pods
Figure 16.7 Using pod anti-affinity to keep pods away from nodes that run pods
with a certain label.
An example of why you’d want to use pod anti-affinity is when two sets of pods inter-
fere with each other’s performance if they run on the same node. In that case, you
want to tell the Scheduler to never schedule those pods on the same node. Another
example would be to force the Scheduler to spread pods of the same group across dif-
ferent availability zones or regions, so that a failure of a whole zone (or region) never
brings the service down completely.
Co-locating pods with pod affinity and anti-affinity 475
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: frontend
spec:
replicas: 5
template:
metadata:
labels: The frontend pods have
app: frontend the app=frontend label.
spec: Defining hard-
affinity: requirements for
podAntiAffinity: pod anti-affinity
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname A frontend pod must not
labelSelector: be scheduled to the same
matchLabels: machine as a pod with
app: frontend app=frontend label.
containers: ...
This time, you’re defining podAntiAffinity instead of podAffinity, and you’re mak-
ing the labelSelector match the same pods that the Deployment creates. Let’s see
what happens when you create this Deployment. The pods created by it are shown in
the following listing.
As you can see, only two pods were scheduled—one to node1, the other to node2. The
three remaining pods are all Pending, because the Scheduler isn’t allowed to schedule
them to the same nodes.
USING PREFERENTIAL POD ANTI-AFFINITY
In this case, you probably should have specified a soft requirement instead (using the
preferredDuringSchedulingIgnoredDuringExecution property). After all, it’s not
such a big problem if two frontend pods run on the same node. But in scenarios where
that’s a problem, using requiredDuringScheduling is appropriate.
476 CHAPTER 16 Advanced scheduling
As with pod affinity, the topologyKey property determines the scope of where the
pod shouldn’t be deployed to. You can use it to ensure pods aren’t deployed to the
same rack, availability zone, region, or any custom scope you create using custom
node labels.
16.4 Summary
In this chapter, we looked at how to ensure pods aren’t scheduled to certain nodes or
are only scheduled to specific nodes, either because of the node’s labels or because of
the pods running on them.
You learned that
If you add a taint to a node, pods won’t be scheduled to that node unless they
tolerate that taint.
Three types of taints exist: NoSchedule completely prevents scheduling, Prefer-
NoSchedule isn’t as strict, and NoExecute even evicts existing pods from a node.
The NoExecute taint is also used to specify how long the Control Plane should
wait before rescheduling the pod when the node it runs on becomes unreach-
able or unready.
Node affinity allows you to specify which nodes a pod should be scheduled to. It
can be used to specify a hard requirement or to only express a node preference.
Pod affinity is used to make the Scheduler deploy pods to the same node where
another pod is running (based on the pod’s labels).
Pod affinity’s topologyKey specifies how close the pod should be deployed to
the other pod (onto the same node or onto a node in the same rack, availability
zone, or availability region).
Pod anti-affinity can be used to keep certain pods away from each other.
Both pod affinity and anti-affinity, like node affinity, can either specify hard
requirements or preferences.
In the next chapter, you’ll learn about best practices for developing apps and how to
make them run smoothly in a Kubernetes environment.
Best practices
for developing apps
We’ve now covered most of what you need to know to run your apps in Kubernetes.
We’ve explored what each individual resource does and how it’s used. Now we’ll see
how to combine them in a typical application running on Kubernetes. We’ll also
look at how to make an application run smoothly. After all, that’s the whole point
of using Kubernetes, isn’t it?
Hopefully, this chapter will help to clear up any misunderstandings and explain
things that weren’t explained clearly yet. Along the way, we’ll also introduce a few
additional concepts that haven’t been mentioned up to this point.
477
478 CHAPTER 17 Best practices for developing apps
Defined in the app manifest by the developer Created by a cluster admin beforehand
Deployment Service
account
Pod template imagePullSecret
labels
Secret(s)
LimitRange
Container(s)
Horizontal • Health probes Storage
PodAutoscaler • Environment variables Class ResourceQuota
• Volume mounts
• Resource reqs/limits
StatefulSet
Created automatically at runtime
Persistent
Volume(s)
DaemonSet Volume
Claim Persistent
Job Volume Label selector
ConfigMap
CronJob ReplicaSet(s)
labels
Pod(s)
The application also contains one or more ConfigMaps, which are either used to
initialize environment variables or mounted as a configMap volume in the pod. Cer-
tain pods use additional volumes, such as an emptyDir or a gitRepo volume, whereas
pods requiring persistent storage use persistentVolumeClaim volumes. The Persistent-
VolumeClaims are also part of the application manifest, whereas StorageClasses refer-
enced by them are created by system administrators upfront.
In certain cases, an application also requires the use of Jobs or CronJobs. Daemon-
Sets aren’t normally part of application deployments, but are usually created by sysad-
mins to run system services on all or a subset of nodes. HorizontalPodAutoscalers
are either included in the manifest by the developers or added to the system later by
the ops team. The cluster administrator also creates LimitRange and ResourceQuota
objects to keep compute resource usage of individual pods and all the pods (as a
whole) under control.
After the application is deployed, additional objects are created automatically by
the various Kubernetes controllers. These include service Endpoints objects created
by the Endpoints controller, ReplicaSets created by the Deployment controller, and
the actual pods created by the ReplicaSet (or Job, CronJob, StatefulSet, or DaemonSet)
controllers.
Resources are often labeled with one or more labels to keep them organized. This
doesn’t apply only to pods but to all other resources as well. In addition to labels, most
resources also contain annotations that describe each resource, list the contact infor-
mation of the person or team responsible for it, or provide additional metadata for
management and other tools.
At the center of all this is the Pod, which arguably is the most important Kuberne-
tes resource. After all, each of your applications runs inside it. To make sure you know
how to develop apps that make the most out of their environment, let’s take one last
close look at pods—this time from the application’s perspective.
reconfigures them and makes sure they still run properly after the move. This means
application developers need to make sure their apps allow being moved relatively
often.
EXPECTING THE LOCAL IP AND HOSTNAME TO CHANGE
When a pod is killed and run elsewhere (technically, it’s a new pod instance replac-
ing the old one; the pod isn’t relocated), it not only has a new IP address but also a
new name and hostname. Most stateless apps can usually handle this without any
adverse effects, but stateful apps usually can’t. We’ve learned that stateful apps can
be run through a StatefulSet, which ensures that when the app starts up on a new
node after being rescheduled, it will still see the same host name and persistent state
as before. The pod’s IP will change nevertheless. Apps need to be prepared for that
to happen. The application developer therefore should never base membership in a
clustered app on the member’s IP address, and if basing it on the hostname, should
always use a StatefulSet.
EXPECTING THE DATA WRITTEN TO DISK TO DISAPPEAR
Another thing to keep in mind is that if the app writes data to disk, that data may not be
available after the app is started inside a new pod, unless you mount persistent storage at
the location the app is writing to. It should be clear this happens when the pod is
rescheduled, but files written to disk will disappear even in scenarios that don’t involve
any rescheduling. Even during the lifetime of a single pod, the files written to disk by
the app running in the pod may disappear. Let me explain this with an example.
Imagine an app that has a long and computationally intensive initial startup proce-
dure. To help the app come up faster on subsequent startups, the developers make
the app cache the results of the initial startup on disk (an example of this would be
the scanning of all Java classes for annotations at startup and then writing the results
to an index file). Because apps in Kubernetes run in containers by default, these files
are written to the container’s filesystem. If the container is then restarted, they’re all
lost, because the new container starts off with a completely new writable layer (see fig-
ure 17.2).
Don’t forget that individual containers may be restarted for several reasons, such
as because the process crashes, because the liveness probe returned a failure, or
because the node started running out of memory and the process was killed by the
OOMKiller. When this happens, the pod is still the same, but the container itself is
completely new. The Kubelet doesn’t run the same container again; it always creates a
new container.
USING VOLUMES TO PRESERVE DATA ACROSS CONTAINER RESTARTS
When its container is restarted, the app in the example will need to perform the
intensive startup procedure again. This may or may not be desired. To make sure data
like this isn’t lost, you need to use at least a pod-scoped volume. Because volumes live
and die together with the pod, the new container will be able to reuse the data written
to the volume by the previous container (figure 17.3).
Understanding the pod’s lifecycle 481
Pod
Figure 17.2 Files written to the container’s filesystem are lost when the container is restarted.
Can read
Writes to Filesystem Filesystem the same files
Pod
Using a volume to preserve files across container restarts is a great idea sometimes,
but not always. What if the data gets corrupted and causes the newly created process
to crash again? This will result in a continuous crash loop (the pod will show the
CrashLoopBackOff status). If you hadn’t used a volume, the new container would start
from scratch and most likely not crash. Using volumes to preserve files across con-
tainer restarts like this is a double-edged sword. You need to think carefully about
whether to use them or not.
You’d probably expect the pod to be deleted and replaced with another pod instance
that might run successfully on another node. After all, the container may be crashing
because of a node-related problem that doesn’t manifest itself on other nodes. Sadly,
that isn’t the case. The ReplicaSet controller doesn’t care if the pods are dead—all it
Understanding the pod’s lifecycle 483
cares about is that the number of pods matches the desired replica count, which in
this case, it does.
If you’d like to see for yourself, I’ve included a YAML manifest for a ReplicaSet
whose pods will keep crashing (see file replicaset-crashingpods.yaml in the code
archive). If you create the ReplicaSet and inspect the pods that are created, the follow-
ing listing is what you’ll see.
In a way, it’s understandable that Kubernetes behaves this way. The container will be
restarted every five minutes in the hope that the underlying cause of the crash will be
resolved. The rationale is that rescheduling the pod to another node most likely
wouldn’t fix the problem anyway, because the app is running inside a container and
all the nodes should be mostly equivalent. That’s not always the case, but it is most of
the time.
whole system is usually defined in a single YAML or JSON containing multiple Pods,
Services, and other objects.
The Kubernetes API server does process the objects in the YAML/JSON in the
order they’re listed, but this only means they’re written to etcd in that order. You have
no guarantee that pods will also be started in that order.
But you can prevent a pod’s main container from starting until a precondition is
met. This is done by including an init containers in the pod.
INTRODUCING INIT CONTAINERS
In addition to regular containers, pods can also include init containers. As the name
suggests, they can be used to initialize the pod—this often means writing data to the
pod’s volumes, which are then mounted into the pod’s main container(s).
A pod may have any number of init containers. They’re executed sequentially and
only after the last one completes are the pod’s main containers started. This means
init containers can also be used to delay the start of the pod’s main container(s)—for
example, until a certain precondition is met. An init container could wait for a service
required by the pod’s main container to be up and ready. When it is, the init container
terminates and allows the main container(s) to be started. This way, the main con-
tainer wouldn’t use the service before it’s ready.
Let’s look at an example of a pod using an init container to delay the start of the
main container. Remember the fortune pod you created in chapter 7? It’s a web
server that returns a fortune quote as a response to client requests. Now, let’s imagine
you have a fortune-client pod that requires the fortune Service to be up and run-
ning before its main container starts. You can add an init container, which checks
whether the Service is responding to requests. Until that’s the case, the init container
keeps retrying. Once it gets a response, the init container terminates and lets the main
container start.
ADDING AN INIT CONTAINER TO A POD
Init containers can be defined in the pod spec like main containers but through the
spec.initContainers field. You’ll find the complete YAML for the fortune-client pod
in the book’s code archive. The following listing shows the part where the init con-
tainer is defined.
spec:
initContainers: You’re defining
- name: init an init container, The init container runs a
image: busybox not a regular loop that runs until the
command: container. fortune Service is up.
- sh
- -c
- 'while true; do echo "Waiting for fortune service to come up...";
➥ wget http://fortune -q -T 1 -O /dev/null >/dev/null 2>/dev/null
➥ && break; sleep 1; done; echo "Service is up! Starting main
➥ container."'
Understanding the pod’s lifecycle 485
When you deploy this pod, only its init container is started. This is shown in the pod’s
status when you list pods with kubectl get:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
fortune-client 0/1 Init:0/1 0 1m
The STATUS column shows that zero of one init containers have finished. You can see
the log of the init container with kubectl logs:
$ kubectl logs fortune-client -c init
Waiting for fortune service to come up...
When running the kubectl logs command, you need to specify the name of the init
container with the -c switch (in the example, the name of the pod’s init container is
init, as you can see in listing 17.2).
The main container won’t run until you deploy the fortune Service and the
fortune-server pod. You’ll find them in the fortune-server.yaml file.
BEST PRACTICES FOR HANDLING INTER-POD DEPENDENCIES
You’ve seen how an init container can be used to delay starting the pod’s main con-
tainer(s) until a precondition is met (making sure the Service the pod depends on is
ready, for example), but it’s much better to write apps that don’t require every service
they rely on to be ready before the app starts up. After all, the service may also go
offline later, while the app is already running.
The application needs to handle internally the possibility that its dependencies
aren’t ready. And don’t forget readiness probes. If an app can’t do its job because one
of its dependencies is missing, it should signal that through its readiness probe, so
Kubernetes knows it, too, isn’t ready. You’ll want to do this not only because it pre-
vents the app from being added as a service endpoint, but also because the app’s read-
iness is also used by the Deployment controller when performing a rolling update,
thereby preventing a rollout of a bad version.
These lifecycle hooks are specified per container, unlike init containers, which apply
to the whole pod. As their names suggest, they’re executed when the container starts
and before it stops.
Lifecycle hooks are similar to liveness and readiness probes in that they can either
Execute a command inside the container
Perform an HTTP GET request against a URL
486 CHAPTER 17 Best practices for developing apps
Let’s look at the two hooks individually to see what effect they have on the container
lifecycle.
USING A POST-START CONTAINER LIFECYCLE HOOK
A post-start hook is executed immediately after the container’s main process is started.
You use it to perform additional operations when the application starts. Sure, if you’re
the author of the application running in the container, you can always perform those
operations inside the application code itself. But when you’re running an application
developed by someone else, you mostly don’t want to (or can’t) modify its source
code. Post-start hooks allow you to run additional commands without having to touch
the app. These may signal to an external listener that the app is starting, or they may
initialize the application so it can start doing its job.
The hook is run in parallel with the main process. The name might be somewhat
misleading, because it doesn’t wait for the main process to start up fully (if the process
has an initialization procedure, the Kubelet obviously can’t wait for the procedure to
complete, because it has no way of knowing when that is).
But even though the hook runs asynchronously, it does affect the container in two
ways. Until the hook completes, the container will stay in the Waiting state with the
reason ContainerCreating. Because of this, the pod’s status will be Pending instead of
Running. If the hook fails to run or returns a non-zero exit code, the main container
will be killed.
A pod manifest containing a post-start hook looks like the following listing.
apiVersion: v1
kind: Pod
metadata:
name: pod-with-poststart-hook
spec:
containers:
- image: luksa/kubia
name: kubia
lifecycle: The hook is executed as
postStart: the container starts.
It executes the exec:
postStart.sh command:
script in the /bin - sh
directory inside - -c
the container. - "echo 'hook will fail with exit code 15'; sleep 5; exit 15"
In the example, the echo, sleep, and exit commands are executed along with the
container’s main process as soon as the container is created. Rather than run a com-
mand like this, you’d typically run a shell script or a binary executable file stored in
the container image.
Sadly, if the process started by the hook logs to the standard output, you can’t see
the output anywhere. This makes debugging lifecycle hooks painful. If the hook fails,
Understanding the pod’s lifecycle 487
you’ll only see a FailedPostStartHook warning among the pod’s events (you can see
them using kubectl describe pod). A while later, you’ll see more information on why
the hook failed, as shown in the following listing.
Listing 17.4 Pod’s events showing the exit code of the failed command-based hook
The number 15 in the last line is the exit code of the command. When using an HTTP
GET hook handler, the reason may look like the following listing (you can try this by
deploying the post-start-hook-httpget.yaml file from the book’s code archive).
Listing 17.5 Pod’s events showing the reason why an HTTP GET hook failed
The standard and error outputs of command-based post-start hooks aren’t logged any-
where, so you may want to have the process the hook invokes log to a file in the con-
tainer’s filesystem, which will allow you to examine the contents of the file with
something like this:
If the container gets restarted for whatever reason (including because the hook failed),
the file may be gone before you can examine it. You can work around that by mount-
ing an emptyDir volume into the container and having the hook write to it.
USING A PRE-STOP CONTAINER LIFECYCLE HOOK
A pre-stop hook is executed immediately before a container is terminated. When a
container needs to be terminated, the Kubelet will run the pre-stop hook, if config-
ured, and only then send a SIGTERM to the process (and later kill the process if it
doesn’t terminate gracefully).
A pre-stop hook can be used to initiate a graceful shutdown of the container, if it
doesn’t shut down gracefully upon receipt of a SIGTERM signal. They can also be used
to perform arbitrary operations before shutdown without having to implement those
operations in the application itself (this is useful when you’re running a third-party
app, whose source code you don’t have access to and/or can’t modify).
Configuring a pre-stop hook in a pod manifest isn’t very different from adding a
post-start hook. The previous example showed a post-start hook that executes a com-
488 CHAPTER 17 Best practices for developing apps
mand, so we’ll look at a pre-stop hook that performs an HTTP GET request now. The
following listing shows how to define a pre-stop HTTP GET hook in a pod.
The pre-stop hook defined in this listing performs an HTTP GET request to http://
POD_IP:8080/shutdown as soon as the Kubelet starts terminating the container.
Apart from the port and path shown in the listing, you can also set the fields scheme
(HTTP or HTTPS) and host, as well as httpHeaders that should be sent in the
request. The host field defaults to the pod IP. Be sure not to set it to localhost,
because localhost would refer to the node, not the pod.
In contrast to the post-start hook, the container will be terminated regardless of
the result of the hook—an error HTTP response code or a non-zero exit code when
using a command-based hook will not prevent the container from being terminated.
If the pre-stop hook fails, you’ll see a FailedPreStopHook warning event among the
pod’s events, but because the pod is deleted soon afterward (after all, the pod’s dele-
tion is what triggered the pre-stop hook in the first place), you may not even notice
that the pre-stop hook failed to run properly.
TIP If the successful completion of the pre-stop hook is critical to the proper
operation of your system, verify whether it’s being executed at all. I’ve wit-
nessed situations where the pre-stop hook didn’t run and the developer
wasn’t even aware of that.
USING A PRE-STOP HOOK BECAUSE YOUR APP DOESN’T RECEIVE THE SIGTERM SIGNAL
Many developers make the mistake of defining a pre-stop hook solely to send a SIGTERM
signal to their apps in the pre-stop hook. They do this because they don’t see their appli-
cation receive the SIGTERM signal sent by the Kubelet. The reason why the signal isn’t
received by the application isn’t because Kubernetes isn’t sending it, but because the sig-
nal isn’t being passed to the app process inside the container itself. If your container
image is configured to run a shell, which in turn runs the app process, the signal may be
eaten up by the shell itself, instead of being passed down to the child process.
In such cases, instead of adding a pre-stop hook to send the signal directly to your
app, the proper fix is to make sure the shell passes the signal to the app. This can be
achieved by handling the signal in the shell script running as the main container pro-
cess and then passing it on to the app. Or you could not configure the container image
to run a shell at all and instead run the application binary directly. You do this by using
the exec form of ENTRYPOINT or CMD in the Dockerfile: ENTRYPOINT ["/mybinary"]
instead of ENTRYPOINT /mybinary.
Understanding the pod’s lifecycle 489
A container using the first form runs the mybinary executable as its main process,
whereas the second form runs a shell as the main process with the mybinary process
executed as a child of the shell process.
UNDERSTANDING THAT LIFECYCLE HOOKS TARGET CONTAINERS, NOT PODS
As a final thought on post-start and pre-stop hooks, let me emphasize that these lifecy-
cle hooks relate to containers, not pods. You shouldn’t use a pre-stop hook for run-
ning actions that need to be performed when the pod is terminating. The reason is
that the pre-stop hook gets called when the container is being terminated (most likely
because of a failed liveness probe). This may happen multiple times in the pod’s life-
time, not only when the pod is in the process of being shut down.
SIGTERM SIGKILL
TIP You should set the grace period to long enough so your process can fin-
ish cleaning up in that time.
The grace period specified in the pod spec can also be overridden when deleting the
pod like this:
This will make the Kubelet wait five seconds for the pod to shut down cleanly. When
all the pod’s containers stop, the Kubelet notifies the API server and the Pod resource
is finally deleted. You can force the API server to delete the resource immediately,
without waiting for confirmation, by setting the grace period to zero and adding the
--force option like this:
Be careful when using this option, especially with pods of a StatefulSet. The Stateful-
Set controller takes great care to never run two instances of the same pod at the same
time (two pods with the same ordinal index and name and attached to the same
PersistentVolume). By force-deleting a pod, you’ll cause the controller to create a
replacement pod without waiting for the containers of the deleted pod to shut
down. In other words, two instances of the same pod might be running at the same
time, which may cause your stateful cluster to malfunction. Only delete stateful pods
forcibly when you’re absolutely sure the pod isn’t running anymore or can’t talk to
the other members of the cluster (you can be sure of this when you confirm that the
node that hosted the pod has failed or has been disconnected from the network and
can’t reconnect).
Now that you understand how containers are shut down, let’s look at it from the
application’s perspective and go over how applications should handle the shutdown
procedure.
IMPLEMENTING THE PROPER SHUTDOWN HANDLER IN YOUR APPLICATION
Applications should react to a SIGTERM signal by starting their shut-down procedure
and terminating when it finishes. Instead of handling the SIGTERM signal, the applica-
tion can be notified to shut down through a pre-stop hook. In both cases, the app
then only has a fixed amount of time to terminate cleanly.
But what if you can’t predict how long the app will take to shut down cleanly? For
example, imagine your app is a distributed data store. On scale-down, one of the pod
instances will be deleted and therefore shut down. In the shut-down procedure, the
Understanding the pod’s lifecycle 491
pod needs to migrate all its data to the remaining pods to make sure it’s not lost.
Should the pod start migrating the data upon receiving a termination signal (through
either the SIGTERM signal or through a pre-stop hook)?
Absolutely not! This is not recommended for at least the following two reasons:
A container terminating doesn’t necessarily mean the whole pod is being
terminated.
You have no guarantee the shut-down procedure will finish before the process
is killed.
This second scenario doesn’t happen only when the grace period runs out before the
application has finished shutting down gracefully, but also when the node running
the pod fails in the middle of the container shut-down sequence. Even if the node
then starts up again, the Kubelet will not restart the shut-down procedure (it won’t
even start up the container again). There are absolutely no guarantees that the pod
will be allowed to complete its whole shut-down procedure.
REPLACING CRITICAL SHUT-DOWN PROCEDURES WITH DEDICATED SHUT-DOWN PROCEDURE PODS
How do you ensure that a critical shut-down procedure that absolutely must run to
completion does run to completion (for example, to ensure that a pod’s data is
migrated to other pods)?
One solution is for the app (upon receipt of a termination signal) to create a new
Job resource that would run a new pod, whose sole job is to migrate the deleted pod’s
data to the remaining pods. But if you’ve been paying attention, you’ll know that you
have no guarantee the app will indeed manage to create the Job object every single
time. What if the node fails exactly when the app tries to do that?
The proper way to handle this problem is by having a dedicated, constantly run-
ning pod that keeps checking for the existence of orphaned data. When this pod finds
the orphaned data, it can migrate it to the remaining pods. Rather than a constantly
running pod, you can also use a CronJob resource and run the pod periodically.
You may think StatefulSets could help here, but they don’t. As you’ll remember,
scaling down a StatefulSet leaves PersistentVolumeClaims orphaned, leaving the data
stored on the PersistentVolume stranded. Yes, upon a subsequent scale-up, the Persistent-
Volume will be reattached to the new pod instance, but what if that scale-up never
happens (or happens after a long time)? For this reason, you may want to run a
data-migrating pod also when using StatefulSets (this scenario is shown in figure 17.6).
To prevent the migration from occurring during an application upgrade, the data-
migrating pod could be configured to wait a while to give the stateful pod time to
come up again before performing the migration.
492 CHAPTER 17 Best practices for developing apps
PV PV PV PV
Connects to
orphaned PVC
PVC PVC PVC PVC
A-0 A-1 A-0 A-1
Transfers data to
Pod Pod Pod remaining pod(s) Data-migrating
A-0 A-1 A-0 Pod
Scale
down
StatefulSet A StatefulSet A
Job
Replicas: 2 Replicas: 1
cases that gets you far enough and saves you from having to implement a special read-
iness endpoint in your app.
Worker node
B3. Endpoint
modification Worker node
notification
B4. Remove pod
from iptables
kube-proxy iptables
B1. Pod deletion B2. Remove pod
notification as endpoint
Endpoints
controller
In the A sequence of events, you’ll see that as soon as the Kubelet receives the notifica-
tion that the pod should be terminated, it initiates the shutdown sequence as explained
in section 17.2.5 (run the pre-stop hook, send SIGTERM, wait for a period of time, and
then forcibly kill the container if it hasn’t yet terminated on its own). If the app
responds to the SIGTERM by immediately ceasing to receive client requests, any client
trying to connect to it will receive a Connection Refused error. The time it takes for
this to happen from the time the pod is deleted is relatively short because of the direct
path from the API server to the Kubelet.
Now, let’s look at what happens in the other sequence of events—the one leading
up to the pod being removed from the iptables rules (sequence B in the figure).
When the Endpoints controller (which runs in the Controller Manager in the Kuber-
netes Control Plane) receives the notification of the Pod being deleted, it removes
the pod as an endpoint in all services that the pod is a part of. It does this by modify-
ing the Endpoints API object by sending a REST request to the API server. The API
server then notifies all clients watching the Endpoints object. Among those watchers
are all the kube-proxies running on the worker nodes. Each of these proxies then
updates the iptables rules on its node, which is what prevents new connections
from being forwarded to the terminating pod. An important detail here is that
removing the iptables rules has no effect on existing connections—clients who are
already connected to the pod will still send additional requests to the pod through
those existing connections.
Both of these sequences of events happen in parallel. Most likely, the time it takes
to shut down the app’s process in the pod is slightly shorter than the time required for
the iptables rules to be updated. The chain of events that leads to iptables rules
being updated is considerably longer (see figure 17.8), because the event must first
reach the Endpoints controller, which then sends a new request to the API server, and
Kubelet Container(s)
kube-proxy iptables
API server
B4. Update
Endpoints B3. Watch notification
API server iptables
controller (endpoints changed)
rules
B1. Watch
notification B2. Remove pod’s IP
(pod modified) from endpoints
kube-proxy iptables
Time
then the API server must notify the kube-proxy before the proxy finally modifies the
iptables rules. A high probability exists that the SIGTERM signal will be sent well
before the iptables rules are updated on all nodes.
The end result is that the pod may still receive client requests after it was sent the
termination signal. If the app closes the server socket and stops accepting connections
immediately, this will cause clients to receive “Connection Refused” types of errors
(similar to what happens at pod startup if your app isn’t capable of accepting connec-
tions immediately and you don’t define a readiness probe for it).
SOLVING THE PROBLEM
Googling solutions to this problem makes it seem as though adding a readiness probe
to your pod will solve the problem. Supposedly, all you need to do is make the readi-
ness probe start failing as soon as the pod receives the SIGTERM. This is supposed to
cause the pod to be removed as the endpoint of the service. But the removal would
happen only after the readiness probe fails for a few consecutive times (this is configu-
rable in the readiness probe spec). And, obviously, the removal then still needs to
reach the kube-proxy before the pod is removed from iptables rules.
In reality, the readiness probe has absolutely no bearing on the whole process at
all. The Endpoints controller removes the pod from the service Endpoints as soon as
it receives notice of the pod being deleted (when the deletionTimestamp field in the
pod’s spec is no longer null). From that point on, the result of the readiness probe
is irrelevant.
What’s the proper solution to the problem? How can you make sure all requests
are handled fully?
It’s clear the pod needs to keep accepting connections even after it receives the ter-
mination signal up until all the kube-proxies have finished updating the iptables
rules. Well, it’s not only the kube-proxies. There may also be Ingress controllers or
load balancers forwarding connections to the pod directly, without going through the
Service (iptables). This also includes clients using client-side load-balancing. To
ensure none of the clients experience broken connections, you’d have to wait until all
of them somehow notify you they’ll no longer forward connections to the pod.
That’s impossible, because all those components are distributed across many dif-
ferent computers. Even if you knew the location of every one of them and could wait
until all of them say it’s okay to shut down the pod, what do you do if one of them
doesn’t respond? How long do you wait for the response? Remember, during that
time, you’re holding up the shut-down process.
The only reasonable thing you can do is wait for a long-enough time to ensure all
the proxies have done their job. But how long is long enough? A few seconds should
be enough in most situations, but there’s no guarantee it will suffice every time. When
the API server or the Endpoints controller is overloaded, it may take longer for the
notification to reach the kube-proxy. It’s important to understand that you can’t solve
the problem perfectly, but even adding a 5- or 10-second delay should improve the
user experience considerably. You can use a longer delay, but don’t go overboard,
496 CHAPTER 17 Best practices for developing apps
because the delay will prevent the container from shutting down promptly and will
cause the pod to be shown in lists long after it has been deleted, which is always frus-
trating to the user deleting the pod.
WRAPPING UP THIS SECTION
To recap—properly shutting down an application includes these steps:
Wait for a few seconds, then stop accepting new connections.
Close all keep-alive connections not in the middle of a request.
Wait for all active requests to finish.
Then shut down completely.
To understand what’s happening with the connections and requests during this pro-
cess, examine figure 17.9 carefully.
Time
Figure 17.9 Properly handling existing and new connections after receiving a termination signal
Not as simple as exiting the process immediately upon receiving the termination sig-
nal, right? Is it worth going through all this? That’s for you to decide. But the least you
can do is add a pre-stop hook that waits a few seconds, like the one in the following
listing, perhaps.
This way, you don’t need to modify the code of your app at all. If your app already
ensures all in-flight requests are processed completely, this pre-stop delay may be all
you need.
TIP Use the FROM scratch directive in the Dockerfile for these images.
But in practice, you’ll soon see these minimal images are extremely difficult to debug.
The first time you need to run a tool such as ping, dig, curl, or something similar
inside the container, you’ll realize how important it is for container images to also
include at least a limited set of these tools. I can’t tell you what to include and what
not to include in your images, because it depends on how you do things, so you’ll
need to find the sweet spot yourself.
It’s almost mandatory to use tags containing a proper version designator instead
of latest, except maybe in development. Keep in mind that if you use mutable tags
(you push changes to the same tag), you’ll need to set the imagePullPolicy field in
the pod spec to Always. But if you use that in production pods, be aware of the big
caveat associated with it. If the image pull policy is set to Always, the container run-
time will contact the image registry every time a new pod is deployed. This slows
down pod startup a bit, because the node needs to check if the image has been mod-
ified. Worse yet, this policy prevents the pod from starting up when the registry can-
not be contacted.
This will allow you to manage resources in groups instead of individually and make it
easy to see where each resource belongs.
moment. Be nice to the ops people and make their lives easier by including all the
necessary debug information in your log files.
But to make triage even easier, you can use one other Kubernetes feature that
makes it possible to show the reason why a container terminated in the pod’s status.
You do this by having the process write a termination message to a specific file in the
container’s filesystem. The contents of this file are read by the Kubelet when the con-
tainer terminates and are shown in the output of kubectl describe pod. If an applica-
tion uses this mechanism, an operator can quickly see why the app terminated without
even having to look at the container logs.
The default file the process needs to write the message to is /dev/termination-log,
but it can be changed by setting the terminationMessagePath field in the container
definition in the pod spec.
You can see this in action by running a pod whose container dies immediately, as
shown in the following listing.
apiVersion: v1
kind: Pod You’re overriding the
metadata: default path of the
name: pod-with-termination-message termination message file.
spec:
containers:
- image: busybox
The container
name: main will write the
terminationMessagePath: /var/termination-reason message to
command: the file just
- sh before exiting.
- -c
- 'echo "I''ve had enough" > /var/termination-reason ; exit 1'
When running this pod, you’ll soon see the pod’s status shown as CrashLoopBackOff.
If you then use kubectl describe, you can see why the container died, without having
to dig down into its logs, as shown in the following listing.
Listing 17.9 Seeing the container’s termination message with kubectl describe
$ kubectl describe po
Name: pod-with-termination-message
...
Containers:
...
State: Waiting
Reason: CrashLoopBackOff You can see the reason
Last State: Terminated why the container died
Reason: Error without having to
Message: I've had enough inspect its logs.
Exit Code: 1
Started: Tue, 21 Feb 2017 21:38:31 +0100
Finished: Tue, 21 Feb 2017 21:38:31 +0100
500 CHAPTER 17 Best practices for developing apps
Ready: False
Restart Count: 6
As you can see, the “I’ve had enough” message the process wrote to the file /var/ter-
mination-reason is shown in the container’s Last State section. Note that this mecha-
nism isn’t limited only to containers that crash. It can also be used in pods that run a
completable task and terminate successfully (you’ll find an example in the file termi-
nation-message-success.yaml).
This mechanism is great for terminated containers, but you’ll probably agree that
a similar mechanism would also be useful for showing app-specific status messages of
running, not only terminated, containers. Kubernetes currently doesn’t provide any
such functionality and I’m not aware of any plans to introduce it.
NOTE If the container doesn’t write the message to any file, you can set the
terminationMessagePolicy field to FallbackToLogsOnError. In that case,
the last few lines of the container’s log are used as its termination message
(but only when the container terminates unsuccessfully).
TIP If a container crashes and is replaced with a new one, you’ll see the new
container’s log. To see the previous container’s logs, use the --previous
option with kubectl logs.
If the application logs to a file instead of the standard output, you can display the log
file using an alternative approach:
This executes the cat command inside the container and streams the logs back to
kubectl, which prints them out in your terminal.
COPYING LOG AND OTHER FILES TO AND FROM A CONTAINER
You can also copy the log file to your local machine using the kubectl cp command,
which we haven’t looked at yet. It allows you to copy files from and into a container. For
example, if a pod called foo-pod and its single container contains a file at /var/log/
foo.log, you can transfer it to your local machine with the following command:
To copy a file from your local machine into the pod, specify the pod’s name in the sec-
ond argument:
This copies the file localfile to /etc/remotefile inside the pod’s container. If the pod has
more than one container, you specify the container using the -c containerName option.
USING CENTRALIZED LOGGING
In a production system, you’ll want to use a centralized, cluster-wide logging solution,
so all your logs are collected and (permanently) stored in a central location. This
allows you to examine historical logs and analyze trends. Without such a system, a
pod’s logs are only available while the pod exists. As soon as it’s deleted, its logs are
deleted also.
Kubernetes by itself doesn’t provide any kind of centralized logging. The compo-
nents necessary for providing a centralized storage and analysis of all the container
logs must be provided by additional components, which usually run as regular pods in
the cluster.
Deploying centralized logging solutions is easy. All you need to do is deploy a few
YAML/JSON manifests and you’re good to go. On Google Kubernetes Engine, it’s
even easier. Check the Enable Stackdriver Logging checkbox when setting up the clus-
ter. Setting up centralized logging on an on-premises Kubernetes cluster is beyond the
scope of this book, but I’ll give you a quick overview of how it’s usually done.
You may have already heard of the ELK stack composed of ElasticSearch, Logstash,
and Kibana. A slightly modified variation is the EFK stack, where Logstash is replaced
with FluentD.
When using the EFK stack for centralized logging, each Kubernetes cluster node
runs a FluentD agent (usually as a pod deployed through a DaemonSet), which is
responsible for gathering the logs from the containers, tagging them with pod-specific
information, and delivering them to ElasticSearch, which stores them persistently.
ElasticSearch is also deployed as a pod somewhere in the cluster. The logs can then be
viewed and analyzed in a web browser through Kibana, which is a web tool for visualiz-
ing ElasticSearch data. It also usually runs as a pod and is exposed through a Service.
The three components of the EFK stack are shown in the following figure.
Web
Kibana Container logs Container logs Container logs
browser
NOTE In the next chapter, you’ll learn about Helm charts. You can use charts
created by the Kubernetes community to deploy the EFK stack instead of cre-
ating your own YAML manifests.
502 CHAPTER 17 Best practices for developing apps
Minikube’s Docker daemon, all you need to do is point your DOCKER_HOST environ-
ment variable to it. Luckily, this is much easier than it sounds. All you need to do is
run the following command on your local machine:
This will set all the required environment variables for you. You then build your
images the same way as if the Docker daemon was running on your local machine.
After you build the image, you don’t need to push it anywhere, because it’s already
stored locally on the Minikube VM, which means new pods can use the image immedi-
ately. If your pods are already running, you either need to delete them or kill their
containers so they’re restarted.
BUILDING IMAGES LOCALLY AND COPYING THEM OVER TO THE MINIKUBE VM DIRECTLY
If you can’t use the daemon inside the VM to build the images, you still have a way to
avoid having to push the image to a registry and have the Kubelet running in the
Minikube VM pull it. If you build the image on your local machine, you can copy it
over to the Minikube VM with the following command:
As before, the image is immediately ready to be used in a pod. But make sure the
imagePullPolicy in your pod spec isn’t set to Always, because that would cause the
image to be pulled from the external registry again and you’d lose the changes you’ve
copied over.
COMBINING MINIKUBE WITH A PROPER KUBERNETES CLUSTER
You have virtually no limit when developing apps with Minikube. You can even com-
bine a Minikube cluster with a proper Kubernetes cluster. I sometimes run my devel-
opment workloads in my local Minikube cluster and have them talk to my other
workloads that are deployed in a remote multi-node Kubernetes cluster thousands of
miles away.
Once I’m finished with development, I can move my local workloads to the remote
cluster with no modifications and with absolutely no problems thanks to how Kuber-
netes abstracts away the underlying infrastructure from the app.
If you run an agent that periodically (or when it detects a new commit) checks out
your manifests from the Version Control System (VCS), and then runs the apply com-
mand, you can manage your running apps simply by committing changes to the VCS
without having to manually talk to the Kubernetes API server. Luckily, the people at
Box (which coincidently was used to host this book’s manuscript and other materials)
developed and released a tool called kube-applier, which does exactly what I described.
You’ll find the tool’s source code at https://github.com/box/kube-applier.
You can use multiple branches to deploy the manifests to a development, QA, stag-
ing, and production cluster (or in different namespaces in the same cluster).
The kubia.ksonnet file shown in the listing is converted to a full JSON Deployment
manifest when you run the following command:
$ jsonnet kubia.ksonnet
506 CHAPTER 17 Best practices for developing apps
The power of Ksonnet and Jsonnet becomes apparent when you realize you can define
your own higher-level fragments and make all your manifests consistent and duplica-
tion-free. You’ll find more information on using and installing Ksonnet and Jsonnet at
https://github.com/ksonnet/ksonnet-lib.
17.6 Summary
Hopefully, the information in this chapter has given you an even deeper insight into
how Kubernetes works and will help you build apps that feel right at home when
deployed to a Kubernetes cluster. The aim of this chapter was to
Show you how all the resources covered in this book come together to repre-
sent a typical application running in Kubernetes.
Make you think about the difference between apps that are rarely moved
between machines and apps running as pods, which are relocated much more
frequently.
Help you understand that your multi-component apps (or microservices, if you
will) shouldn’t rely on a specific start-up order.
Introduce init containers, which can be used to initialize a pod or delay the start
of the pod’s main containers until a precondition is met.
Teach you about container lifecycle hooks and when to use them.
Gain a deeper insight into the consequences of the distributed nature of
Kubernetes components and its eventual consistency model.
Learn how to make your apps shut down properly without breaking client
connections.
Summary 507
Give you a few small tips on how to make your apps easier to manage by keep-
ing image sizes small, adding annotations and multi-dimensional labels to all
your resources, and making it easier to see why an application terminated.
Teach you how to develop Kubernetes apps and run them locally or in Mini-
kube before deploying them on a proper multi-node cluster.
In the next and final chapter, we’ll learn how you can extend Kubernetes with your
own custom API objects and controllers and how others have done it to create com-
plete Platform-as-a-Service solutions on top of Kubernetes.
Extending Kubernetes
You’re almost done. To wrap up, we’ll look at how you can define your own API
objects and create controllers for those objects. We’ll also look at how others have
extended Kubernetes and built Platform-as-a-Service solutions on top of it.
508
Defining custom API objects 509
As the Kubernetes ecosystem evolves, you’ll see more and more high-level objects,
which will be much more specialized than the resources Kubernetes supports today.
Instead of dealing with Deployments, Services, ConfigMaps, and the like, you’ll create
and manage objects that represent whole applications or software services. A custom
controller will observe those high-level objects and create low-level objects based on
them. For example, to run a messaging broker inside a Kubernetes cluster, all you’ll
need to do is create an instance of a Queue resource and all the necessary Secrets,
Deployments, and Services will be created by a custom Queue controller. Kubernetes
already provides ways of adding custom resources like this.
NOTE Prior to Kubernetes 1.7, custom resources were defined through Third-
PartyResource objects, which were similar to CustomResourceDefinitions, but
were removed in version 1.8.
Creating a CRD so that users can create objects of the new type isn’t a useful feature if
those objects don’t make something tangible happen in the cluster. Each CRD will
usually also have an associated controller (an active component doing something
based on the custom objects), the same way that all the core Kubernetes resources
have an associated controller, as was explained in chapter 11. For this reason, to prop-
erly show what CustomResourceDefinitions allow you to do other than adding
instances of a custom object, a controller must be deployed as well. You’ll do that in
the next example.
INTRODUCING THE EXAMPLE CUSTOMRESOURCEDEFINITION
Let’s imagine you want to allow users of your Kubernetes cluster to run static websites
as easily as possible, without having to deal with Pods, Services, and other Kubernetes
resources. What you want to achieve is for users to create objects of type Website that
contain nothing more than the website’s name and the source from which the web-
site’s files (HTML, CSS, PNG, and others) should be obtained. You’ll use a Git reposi-
tory as the source of those files. When a user creates an instance of the Website
resource, you want Kubernetes to spin up a new web server pod and expose it through
a Service, as shown in figure 18.1.
To create the Website resource, you want users to post manifests along the lines of
the one shown in the following listing.
510 CHAPTER 18 Extending Kubernetes
Website
kind: Website
metadata: Service: Pod:
name: kubia kubia-website kubia-website
spec:
gitRepo:
github.com/.../kubia.git
Figure 18.1 Each Website object should result in the creation of a Service and an HTTP
server Pod.
A custom
kind: Website object kind The name of the website
metadata:
(used for naming the The Git
name: kubia
resulting Service and Pod) repository
spec: holding the
gitRepo: https://github.com/luksa/kubia-website-example.git website’s files
Like all other resources, your resource contains a kind and a metadata.name field,
and like most resources, it also contains a spec section. It contains a single field called
gitRepo (you can choose the name)—it specifies the Git repository containing the
website’s files. You’ll also need to include an apiVersion field, but you don’t know yet
what its value must be for custom resources.
If you try posting this resource to Kubernetes, you’ll receive an error because
Kubernetes doesn’t know what a Website object is yet:
$ kubectl create -f imaginary-kubia-website.yaml
error: unable to recognize "imaginary-kubia-website.yaml": no matches for
➥ /, Kind=Website
Before you can create instances of your custom object, you need to make Kubernetes
recognize them.
CREATING A CUSTOMRESOURCEDEFINITION OBJECT
To make Kubernetes accept your custom Website resource instances, you need to post
the CustomResourceDefinition shown in the following listing to the API server.
After you post the descriptor to Kubernetes, it will allow you to create any number of
instances of the custom Website resource.
You can create the CRD from the website-crd.yaml file available in the code archive:
$ kubectl create -f website-crd-definition.yaml
customresourcedefinition "websites.extensions.example.com" created
I’m sure you’re wondering about the long name of the CRD. Why not call it Website?
The reason is to prevent name clashes. By adding a suffix to the name of the CRD
(which will usually include the name of the organization that created the CRD), you
keep CRD names unique. Luckily, the long name doesn’t mean you’ll need to create
your Website resources with kind: websites.extensions.example.com, but as kind:
Website, as specified in the names.kind property of the CRD. The extensions.exam-
ple.com part is the API group of your resource.
You’ve seen how creating Deployment objects requires you to set apiVersion to
apps/v1beta1 instead of v1. The part before the slash is the API group (Deployments
belong to the apps API group), and the part after it is the version name (v1beta1 in
the case of Deployments). When creating instances of the custom Website resource,
the apiVersion property will need to be set to extensions.example.com/v1.
CREATING AN INSTANCE OF A CUSTOM RESOURCE
Considering what you learned, you’ll now create a proper YAML for your Website
resource instance. The YAML manifest is shown in the following listing.
The kind of your resource is Website, and the apiVersion is composed of the API
group and the version number you defined in the CustomResourceDefinition.
Create your Website object now:
$ kubectl create -f kubia-website.yaml
website "kubia" created
512 CHAPTER 18 Extending Kubernetes
The response tells you that the API server has accepted and stored your custom
Website object. Let’s see if you can now retrieve it.
RETRIEVING INSTANCES OF A CUSTOM RESOURCE
List all the websites in your cluster:
$ kubectl get websites
NAME KIND
kubia Website.v1.extensions.example.com
As with existing Kubernetes resources, you can create and then list instances of cus-
tom resources. You can also use kubectl describe to see the details of your custom
object, or retrieve the whole YAML with kubectl get, as in the following listing.
Listing 18.4 Full Website resource definition retrieved from the API server
Note that the resource includes everything that was in the original YAML definition,
and that Kubernetes has initialized additional metadata fields the way it does with all
other resources.
DELETING AN INSTANCE OF A CUSTOM OBJECT
Obviously, in addition to creating and retrieving custom object instances, you can also
delete them:
$ kubectl delete website kubia
website "kubia" deleted
In general, the point of creating custom objects like this isn’t always to make some-
thing happen when the object is created. Certain custom objects are used to store data
instead of using a more generic mechanism such as a ConfigMap. Applications run-
ning inside pods can query the API server for those objects and read whatever is
stored in them.
But in this case, we said you wanted the existence of a Website object to result in
the spinning up of a web server serving the contents of the Git repository referenced
in the object. We’ll look at how to do that next.
API server
Websites
Watches
Website:
kubia
Deployments
Website Deployment:
controller kubia-website
Creates
Services
Service:
kubia-website Figure 18.2 The Website controller
watches for Website objects and
creates a Deployment and a Service.
I’ve written a simple initial version of the controller, which works well enough to
show CRDs and the controller in action, but it’s far from being production-ready,
because it’s overly simplified. The container image is available at docker.io/luksa/
website-controller:latest, and the source code is at https://github.com/luksa/k8s-
website-controller. Instead of going through its source code, I’ll explain what the con-
troller does.
514 CHAPTER 18 Extending Kubernetes
http://localhost:8001/apis/extensions.example.com/v1/websites?watch=true
You may recognize the hostname and port—the controller isn’t connecting to the
API server directly, but is instead connecting to the kubectl proxy process, which
runs in a sidecar container in the same pod and acts as the ambassador to the API
server (we examined the ambassador pattern in chapter 8). The proxy forwards the
request to the API server, taking care of both TLS encryption and authentication
(see figure 18.3).
Pod: website-controller
Figure 18.3 The Website controller talks to the API server through a proxy (in the ambassador container).
Through the connection opened by this HTTP GET request, the API server will send
watch events for every change to any Website object.
The API server sends the ADDED watch event every time a new Website object is cre-
ated. When the controller receives such an event, it extracts the Website’s name and
the URL of the Git repository from the Website object it received in the watch event
and creates a Deployment and a Service object by posting their JSON manifests to the
API server.
The Deployment resource contains a template for a pod with two containers
(shown in figure 18.4): one running an nginx server and another one running a git-
sync process, which keeps a local directory synced with the contents of a Git repo.
The local directory is shared with the nginx container through an emptyDir volume
(you did something similar to that in chapter 6, but instead of keeping the local
directory synced with a Git repo, you used a gitRepo volume to download the Git
repo’s contents at pod startup; the volume’s contents weren’t kept in sync with the
Git repo afterward). The Service is a NodePort Service, which exposes your web
server pod through a random port on each node (the same port is used on all
nodes). When a pod is created by the Deployment object, clients can access the web-
site through the node port.
Defining custom API objects 515
Serves website to
web client through
a random port
Pod
Webserver
Web client
container
emptyDir
git-sync volume
container
The API server also sends a DELETED watch event when a Website resource instance is
deleted. Upon receiving the event, the controller deletes the Deployment and the Ser-
vice resources it created earlier. As soon as a user deletes the Website instance, the
controller will shut down and remove the web server serving that website.
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: website-controller You’ll run a single
spec: replica of the
replicas: 1 controller.
template:
516 CHAPTER 18 Extending Kubernetes
metadata:
name: website-controller It will run
labels: under a special
app: website-controller ServiceAccount.
spec:
serviceAccountName: website-controller
containers:
- name: main Two containers: the
image: luksa/website-controller main container and
- name: proxy the proxy sidecar
image: luksa/kubectl-proxy:1.6.2
As you can see, the Deployment deploys a single replica of a two-container pod. One
container runs your controller, whereas the other one is the ambassador container
used for simpler communication with the API server. The pod runs under its own spe-
cial ServiceAccount, so you’ll need to create it before you deploy the controller:
$ kubectl create serviceaccount website-controller
serviceaccount "website-controller" created
If Role Based Access Control (RBAC) is enabled in your cluster, Kubernetes will not
allow the controller to watch Website resources or create Deployments or Services. To
allow it to do that, you’ll need to bind the website-controller ServiceAccount to the
cluster-admin ClusterRole, by creating a ClusterRoleBinding like this:
$ kubectl create clusterrolebinding website-controller
➥ --clusterrole=cluster-admin
➥ --serviceaccount=default:website-controller
clusterrolebinding "website-controller" created
Once you have the ServiceAccount and ClusterRoleBinding in place, you can deploy
the controller’s Deployment.
SEEING THE CONTROLLER IN ACTION
With the controller now running, create the kubia Website resource again:
$ kubectl create -f kubia-website.yaml
website "kubia" created
Now, let’s check the controller’s logs (shown in the following listing) to see if it has
received the watch event.
The logs show that the controller received the ADDED event and that it created a Service
and a Deployment for the kubia-website Website. The API server responded with a
201 Created response, which means the two resources should now exist. Let’s verify
that the Deployment, Service and the resulting Pod were created. The following list-
ing lists all Deployments, Services and Pods.
Listing 18.7 The Deployment, Service, and Pod created for the kubia-website
There they are. The kubia-website Service, through which you can access your web-
site, is available on port 32589 on all cluster nodes. You can access it with your browser.
Awesome, right?
Users of your Kubernetes cluster can now deploy static websites in seconds, with-
out knowing anything about Pods, Services, or any other Kubernetes resources, except
your custom Website resource.
Obviously, you still have room for improvement. The controller could, for exam-
ple, watch for Service objects and as soon as the node port is assigned, write the URL
the website is accessible at into the status section of the Website resource instance
itself. Or it could also create an Ingress object for each website. I’ll leave the imple-
mentation of these additional features to you as an exercise.
to notice the error message by querying the API server for the Website object. Unless
the user does this, they have no way of knowing whether the object is valid or not.
This obviously isn’t ideal. You’d want the API server to validate the object and
reject invalid objects immediately. Validation of custom objects was introduced in
Kubernetes version 1.8 as an alpha feature. To have the API server validate your cus-
tom objects, you need to enable the CustomResourceValidation feature gate in the
API server and specify a JSON schema in the CRD.
Uses CustomResourceDefinitions
in main API server as storage
mechanism
Custom
API server X
Main
kubectl
API server
Custom
API server Y etcd
etcd
In your case, you could create an API server responsible for handling your Website
objects. It could validate those objects the way the core Kubernetes API server validates
them. You’d no longer need to create a CRD to represent those objects, because you’d
implement the Website object type into the custom API server directly.
Generally, each API server is responsible for storing their own resources. As shown
in figure 18.5, it can either run its own instance of etcd (or a whole etcd cluster), or it
Extending Kubernetes with the Kubernetes Service Catalog 519
can store its resources in the core API server’s etcd store by creating CRD instances in
the core API server. In that case, it needs to create a CRD object first, before creating
instances of the CRD, the way you did in the example.
REGISTERING A CUSTOM API SERVER
To add a custom API server to your cluster, you’d deploy it as a pod and expose it
through a Service. Then, to integrate it into the main API server, you’d deploy a YAML
manifest describing an APIService resource like the one in the following listing.
After creating the APIService resource from the previous listing, client requests sent
to the main API server that contain any resource from the extensions.example.com
API group and version v1alpha1 would be forwarded to the custom API server pod(s)
exposed through the website-api Service.
CREATING CUSTOM CLIENTS
While you can create custom resources from YAML files using the regular kubectl cli-
ent, to make deployment of custom objects even easier, in addition to providing a cus-
tom API server, you can also build a custom CLI tool. This will allow you to add
dedicated commands for manipulating those objects, similar to how kubectl allows
creating Secrets, Deployments, and other resources through resource-specific com-
mands like kubectl create secret or kubectl create deployment.
As I’ve already mentioned, custom API servers, API server aggregation, and other
features related to extending Kubernetes are currently being worked on intensively, so
they may change after the book is published. To get up-to-date information on the
subject, refer to the Kubernetes GitHub repos at http://github.com/kubernetes.
required to allow users to use a database in their app), someone needs to deploy the
pods providing the service, a Service resource, and possibly a Secret so the client pod
can use it to authenticate with the service. That someone is usually the same user
deploying the client pod or, if a team is dedicated to deploying these types of general
services, the user needs to file a ticket and wait for the team to provision the service.
This means the user needs to either create the manifests for all the components of the
service, know where to find an existing set of manifests, know how to configure it
properly, and deploy it manually, or wait for the other team to do it.
But Kubernetes is supposed to be an easy-to-use, self-service system. Ideally, users
whose apps require a certain service (for example, a web application requiring a back-
end database), should be able to say to Kubernetes. “Hey, I need a PostgreSQL data-
base. Please provision one and tell me where and how I can connect to it.” This will
soon be possible through the Kubernetes Service Catalog.
their pods. Those pods are then injected with a Secret that holds all the necessary cre-
dentials and other data required to connect to the provisioned ServiceInstance.
The Service Catalog system architecture is shown in figure 18.7.
Client pods
The components shown in the figure are explained in the following sections.
The four Service Catalog–related resources we introduced earlier are created by post-
ing YAML/JSON manifests to the API server. It then stores them into its own etcd
instance or uses CustomResourceDefinitions in the main API server as an alternative
storage mechanism (in that case, no additional etcd instance is required).
The controllers running in the Controller Manager are the ones doing some-
thing with those resources. They obviously talk to the Service Catalog API server, the
way other core Kubernetes controllers talk to the core API server. Those controllers
don’t provision the requested services themselves. They leave that up to external
service brokers, which are registered by creating ServiceBroker resources in the Ser-
vice Catalog API.
522 CHAPTER 18 Extending Kubernetes
The listing describes an imaginary broker that can provision databases of different
types. After the administrator creates the ClusterServiceBroker resource, a controller
in the Service Catalog Controller Manager connects to the URL specified in the
resource to retrieve the list of services this broker can provision.
After the Service Catalog retrieves the list of services, it creates a ClusterService-
Class resource for each of them. Each ClusterServiceClass resource describes a sin-
gle type of service that can be provisioned (an example of a ClusterServiceClass is
“PostgreSQL database”). Each ClusterServiceClass has one or more service plans asso-
ciated with it. These allow the user to choose the level of service they need (for exam-
ple, a database ClusterServiceClass could provide a “Free” plan, where the size of the
Extending Kubernetes with the Kubernetes Service Catalog 523
database is limited and the underlying storage is a spinning disk, and a “Premium”
plan, with unlimited size and SSD storage).
LISTING THE AVAILABLE SERVICES IN A CLUSTER
Users of the Kubernetes cluster can retrieve a list of all services that can be provi-
sioned in the cluster with kubectl get serviceclasses, as shown in the following
listing.
The listing shows ClusterServiceClasses for services that your imaginary database bro-
ker could provide. You can compare ClusterServiceClasses to StorageClasses, which we
discussed in chapter 6. StorageClasses allow you to select the type of storage you’d like
to use in your pods, while ClusterServiceClasses allow you to select the type of service.
You can see details of one of the ClusterServiceClasses by retrieving its YAML. An
example is shown in the following listing.
The ClusterServiceClass in the listing contains two plans—a free plan, and a premium
plan. You can see that this ClusterServiceClass is provided by the database-broker
broker.
524 CHAPTER 18 Extending Kubernetes
apiVersion: servicecatalog.k8s.io/v1alpha1
kind: ServiceInstance
metadata: You’re giving this
name: my-postgres-db Instance a name.
spec:
clusterServiceClassName: postgres-database The ServiceClass
clusterServicePlanName: free and Plan you want
parameters:
init-db-args: --data-checksums Additional parameters
passed to the broker
You created a ServiceInstance called my-postgres-db (that will be the name of the
resource you’re deploying) and specified the ClusterServiceClass and the chosen
plan. You’re also specifying a parameter, which is specific for each broker and Cluster-
ServiceClass. Let’s imagine you looked up the possible parameters in the broker’s doc-
umentation.
As soon as you create this resource, the Service Catalog will contact the broker the
ClusterServiceClass belongs to and ask it to provision the service. It will pass on the
chosen ClusterServiceClass and plan names, as well as all the parameters you specified.
It’s then completely up to the broker to know what to do with this information. In
your case, your database broker will probably spin up a new instance of a PostgreSQL
database somewhere—not necessarily in the same Kubernetes cluster or even in
Kubernetes at all. It could run a Virtual Machine and run the database in there. The
Service Catalog doesn’t care, and neither does the user requesting the service.
You can check if the service has been provisioned successfully by inspecting the
status section of the my-postgres-db ServiceInstance you created, as shown in the
following listing.
- lastTransitionTime: 2017-05-17T13:57:22Z
message: The instance was provisioned successfully The database was
reason: ProvisionedSuccessfully provisioned successfully.
status: "True"
type: Ready It’s ready to be used.
A database instance is now running somewhere, but how do you use it in your pods?
To do that, you need to bind it.
BINDING A SERVICEINSTANCE
To use a provisioned ServiceInstance in your pods, you create a ServiceBinding
resource, as shown in the following listing.
apiVersion: servicecatalog.k8s.io/v1alpha1
kind: ServiceBinding You’re referencing the
metadata: instance you created
name: my-postgres-db-binding earlier.
spec:
instanceRef: You’d like the credentials
name: my-postgres-db for accessing the service
secretName: postgres-secret stored in this Secret.
The listing shows that you’re defining a ServiceBinding resource called my-postgres-
db-binding, in which you’re referencing the my-postgres-db service instance you
created earlier. You’re also specifying a name of a Secret. You want the Service Catalog
to put all the necessary credentials for accessing the service instance into a Secret
called postgres-secret. But where are you binding the ServiceInstance to your pods?
Nowhere, actually.
Currently, the Service Catalog doesn’t yet make it possible to inject pods with the
ServiceInstance’s credentials. This will be possible when a new Kubernetes feature
called PodPresets is available. Until then, you can choose a name for the Secret
where you want the credentials to be stored in and mount that Secret into your pods
manually.
When you submit the ServiceBinding resource from the previous listing to the Ser-
vice Catalog API server, the controller will contact the Database broker once again
and create a binding for the ServiceInstance you provisioned earlier. The broker
responds with a list of credentials and other data necessary for connecting to the data-
base. The Service Catalog creates a new Secret with the name you specified in the
ServiceBinding resource and stores all that data in the Secret.
USING THE NEWLY CREATED SECRET IN CLIENT PODS
The Secret created by the Service Catalog system can be mounted into pods, so they
can read its contents and use them to connect to the provisioned service instance (a
PostgreSQL database in the example). The Secret could look like the one in the fol-
lowing listing.
526 CHAPTER 18 Extending Kubernetes
Listing 18.15 A Secret holding the credentials for connecting to the service instance
Because you can choose the name of the Secret yourself, you can deploy pods before
provisioning or binding the service. As you learned in chapter 7, the pods won’t be
started until such a Secret exists.
If necessary, multiple bindings can be created for different pods. The service bro-
ker can choose to use the same set of credentials in every binding, but it’s better to
create a new set of credentials for every binding instance. This way, pods can be pre-
vented from using the service by deleting the ServiceBinding resource.
When you do this, the Service Catalog controller will delete the Secret and call the bro-
ker to perform an unbinding operation. The service instance (in your case a PostgreSQL
database) is still running. You can therefore create a new ServiceBinding if you want.
But if you don’t need the database instance anymore, you should delete the Service-
Instance resource also:
$ kubectl delete serviceinstance my-postgres-db
serviceinstance "my-postgres-db " deleted
Deleting the ServiceInstance resource causes the Service Catalog to perform a depro-
visioning operation on the service broker. Again, exactly what that means is up to the
service broker, but in your case, the broker should shut down the PostgreSQL data-
base instance that it created when we provisioned the service instance.
For example, I’ve been involved with the Service Catalog since early on and have
implemented a broker, which makes it trivial to provision messaging systems and
expose them to pods in a Kubernetes cluster. Another team has implemented a broker
that makes it easy to provision Amazon Web Services.
In general, service brokers allow easy provisioning and exposing of services in
Kubernetes and will make Kubernetes an even more awesome platform for deploying
your applications.
Templates
BuildConfigs
DeploymentConfigs
ImageStreams
Routes
And others
Template Template
The template itself is a JSON or YAML file containing a list of parameters that are ref-
erenced in resources defined in that same JSON/YAML. The template can be stored
in the API server like any other object. Before a template can be instantiated, it needs
Platforms built on top of Kubernetes 529
to be processed. To process a template, you supply the values for the template’s
parameters and then OpenShift replaces the references to the parameters with those
values. The result is a processed template, which is exactly like a Kubernetes resource
list that can then be created with a single POST request.
OpenShift provides a long list of pre-fabricated templates that allow users to
quickly run complex applications by specifying a few arguments (or none at all, if the
template provides good defaults for those arguments). For example, a template can
enable the creation of all the Kubernetes resources necessary to run a Java EE appli-
cation inside an Application Server, which connects to a back-end database, also
deployed as part of that same template. All those components can be deployed with a
single command.
BUILDING IMAGES FROM SOURCE USING BUILDCONFIGS
One of the best features of OpenShift is the ability to have OpenShift build and imme-
diately deploy an application in the OpenShift cluster by pointing it to a Git repository
holding the application’s source code. You don’t need to build the container image at
all—OpenShift does that for you. This is done by creating a resource called Build-
Config, which can be configured to trigger builds of container images immediately
after a change is committed to the source Git repository.
Although OpenShift doesn’t monitor the Git repository itself, a hook in the repos-
itory can notify OpenShift of the new commit. OpenShift will then pull the changes
from the Git repository and start the build process. A build mechanism called Source
To Image can detect what type of application is in the Git repository and run the
proper build procedure for it. For example, if it detects a pom.xml file, which is used
in Java Maven-formatted projects, it runs a Maven build. The resulting artifacts are
packaged into an appropriate container image, and are then pushed to an internal
container registry (provided by OpenShift). From there, they can be pulled and run
in the cluster immediately.
By creating a BuildConfig object, developers can thus point to a Git repo and not
worry about building container images. Developers have almost no need to know
anything about containers. Once the ops team deploys an OpenShift cluster and
gives developers access to it, those developers can develop their code, commit, and
push it to a Git repo, the same way they used to before we started packaging apps into
containers. Then OpenShift takes care of building, deploying, and managing apps
from that code.
AUTOMATICALLY DEPLOYING NEWLY BUILT IMAGES WITH DEPLOYMENTCONFIGS
Once a new container image is built, it can also automatically be deployed in the clus-
ter. This is enabled by creating a DeploymentConfig object and pointing it to an
ImageStream. As the name suggests, an ImageStream is a stream of images. When an
image is built, it’s added to the ImageStream. This enables the DeploymentConfig to
notice the newly built image and allows it to take action and initiate a rollout of the
new image (see figure 18.9).
530 CHAPTER 18 Extending Kubernetes
Replication
Build trigger BuildConfig DeploymentConfig Pods
Controller
Builder pod
Watches for new images in ImageStream
and rolls out new version (similarly to a
Deployment)
Clones Git repo, builds new
image from source, and adds
it to the ImageStream
they’ve also developed a tool called Helm, which is gaining traction in the Kubernetes
community as a standard way of deploying existing apps in Kubernetes. We’ll take a
brief look at both.
INTRODUCING DEIS WORKFLOW
You can deploy Deis Workflow to any existing Kubernetes cluster (unlike OpenShift,
which is a complete cluster with a modified API server and other Kubernetes compo-
nents). When you run Workflow, it creates a set of Services and ReplicationControllers,
which then provide developers with a simple, developer-friendly environment.
Deploying new versions of your app is triggered by pushing your changes with git
push deis master and letting Workflow take care of the rest. Similar to OpenShift,
Workflow also provides a source to image mechanism, application rollouts and roll-
backs, edge routing, and also log aggregation, metrics, and alerting, which aren’t
available in core Kubernetes.
To run Workflow in your Kubernetes cluster, you first need to install the Deis Work-
flow and Helm CLI tools and then install Workflow into your cluster. We won’t go into
how to do that here, but if you’d like to learn more, visit the website at https://deis
.com/workflow. What we’ll explore here is the Helm tool, which can be used without
Workflow and has gained popularity in the community.
DEPLOYING RESOURCES THROUGH HELM
Helm is a package manager for Kubernetes (similar to OS package managers like yum
or apt in Linux or homebrew in MacOS).
Helm is comprised of two things:
A helm CLI tool (the client).
Tiller, a server component running as a Pod inside the Kubernetes cluster.
Those two components are used to deploy and manage application packages in a
Kubernetes cluster. Helm application packages are called Charts. They’re combined
with a Config, which contains configuration information and is merged into a Chart
to create a Release, which is a running instance of an application (a combined Chart
and Config). You deploy and manage Releases using the helm CLI tool, which talks to
the Tiller server, which is the component that creates all the necessary Kubernetes
resources defined in the Chart, as shown in figure 18.10.
You can create charts yourself and keep them on your local disk, or you can use
any existing chart, which is available in the growing list of helm charts maintained by
the community at https://github.com/kubernetes/charts. The list includes charts for
applications such as PostgreSQL, MySQL, MariaDB, Magento, Memcached, MongoDB,
OpenVPN, PHPBB, RabbitMQ, Redis, WordPress, and others.
Similar to how you don’t build and install apps developed by other people to your
Linux system manually, you probably don’t want to build and manage your own
Kubernetes manifests for such applications, right? That’s why you’ll want to use Helm
and the charts available in the GitHub repository I mentioned.
532 CHAPTER 18 Extending Kubernetes
Chart
Deployments,
helm and Tiller
Services, and
CLI tool Config (pod)
other objects
Kubernetes cluster
Helm
Charts
(files on
local disk)
When you want to run a PostgreSQL or a MySQL database in your Kubernetes cluster,
don’t start writing manifests for them. Instead, check if someone else has already gone
through the trouble and prepared a Helm chart for it.
Once someone prepares a Helm chart for a specific application and adds it to the
Helm chart GitHub repo, installing the whole application takes a single one-line com-
mand. For example, to run MySQL in your Kubernetes cluster, all you need to do is
clone the charts Git repo to your local machine and run the following command (pro-
vided you have Helm’s CLI tool and Tiller running in your cluster):
This will create all the necessary Deployments, Services, Secrets, and PersistentVolu-
meClaims needed to run MySQL in your cluster. You don’t need to concern yourself
with what components you need and how to configure them to run MySQL properly.
I’m sure you’ll agree this is awesome.
TIP One of the most interesting charts available in the repo is an OpenVPN
chart, which runs an OpenVPN server inside your Kubernetes cluster and
allows you to enter the pod network through VPN and access Services as if
your local machine was a pod in the cluster. This is useful when you’re devel-
oping apps and running them locally.
These were several examples of how Kubernetes can be extended and how companies
like Red Hat and Deis (now Microsoft) have extended it. Now go and start riding the
Kubernetes wave yourself!
Summary 533
18.4 Summary
This final chapter has shown you how you can go beyond the existing functionalities
Kubernetes provides and how companies like Dies and Red Hat have done it. You’ve
learned how
Custom resources can be registered in the API server by creating a Custom-
ResourceDefinition object.
Instances of custom objects can be stored, retrieved, updated, and deleted with-
out having to change the API server code.
A custom controller can be implemented to bring those objects to life.
Kubernetes can be extended with custom API servers through API aggregation.
Kubernetes Service Catalog makes it possible to self-provision external services
and expose them to pods running in the Kubernetes cluster.
Platforms-as-a-Service built on top of Kubernetes make it easy to build contain-
erized applications inside the same Kubernetes cluster that then runs them.
A package manager called Helm makes deploying existing apps without requir-
ing you to build resource manifests for them.
Thank you for taking the time to read through this long book. I hope you’ve learned
as much from reading it as I have from writing it.
appendix A
Using kubectl
with multiple clusters
A.1 Switching between Minikube and Google Kubernetes
Engine
The examples in this book can either be run in a cluster created with Minikube, or
one created with Google Kubernetes Engine (GKE). If you plan on using both, you
need to know how to switch between them. A detailed explanation of how to use
kubectl with multiple clusters is described in the next section. Here we look at how
to switch between Minikube and GKE.
SWITCHING TO MINIKUBE
Luckily, every time you start up a Minikube cluster with minikube start, it also
reconfigures kubectl to use it:
$ minikube start
Starting local Kubernetes cluster...
...
Setting up kubeconfig... Minikube sets up kubectl every
Kubectl is now configured to use the cluster. time you start the cluster.
After switching from Minikube to GKE, you can switch back by stopping Minikube
and starting it up again. kubectl will then be re-configured to use the Minikube clus-
ter again.
SWITCHING TO GKE
To switch to using the GKE cluster, you can use the following command:
$ gcloud container clusters get-credentials my-gke-cluster
This will configure kubectl to use the GKE cluster called my-gke-cluster.
534
Using kubectl with multiple clusters or namespaces 535
GOING FURTHER
These two methods should be enough to get you started quickly, but to understand
the complete picture of using kubectl with multiple clusters, study the next section.
NOTE You can use multiple config files and have kubectl use them all at
once by specifying all of them in the KUBECONFIG environment variable (sepa-
rate them with a colon).
apiVersion: v1
clusters:
- cluster:
certificate-authority: /home/luksa/.minikube/ca.crt
Contains
information about a
server: https://192.168.99.100:8443
Kubernetes cluster
name: minikube
contexts:
- context:
cluster: minikube Defines a
user: minikube kubectl
namespace: default context
name: minikube
current-context: minikube
The current context
kind: Config kubectl uses
preferences: {}
users:
- name: minikube
user: Contains
client-certificate: /home/luksa/.minikube/apiserver.crt
a user’s
credentials
client-key: /home/luksa/.minikube/apiserver.key
Each cluster, user, and context has a name. The name is used to refer to the context,
user, or cluster.
CLUSTERS
A cluster entry represents a Kubernetes cluster and contains the URL of the API
server, the certificate authority (CA) file, and possibly a few other configuration
options related to communication with the API server. The CA certificate can be
stored in a separate file and referenced in the kubeconfig file, or it can be included in
it directly in the certificate-authority-data field.
USERS
Each user defines the credentials to use when talking to an API server. This can be a
username and password pair, an authentication token, or a client key and certificate.
The certificate and key can be included in the kubeconfig file (through the client-
certificate-data and client-key-data properties) or stored in separate files and
referenced in the config file, as shown in listing A.1.
CONTEXTS
A context ties together a cluster, a user, and the default namespace kubectl should use
when performing commands. Multiple contexts can point to the same user or cluster.
THE CURRENT CONTEXT
While there can be multiple contexts defined in the kubeconfig file, at any given time
only one of them is the current context. Later we’ll see how the current context can
be changed.
This will add a cluster called my-other-cluster with the API server located at https://
k8s.example.com:6443. To see additional options you can pass to the command, run
kubectl config set-cluster to have it print out usage examples.
If a cluster by that name already exists, the set-cluster command will overwrite
its configuration options.
ADDING OR MODIFYING USER CREDENTIALS
Adding and modifying users is similar to adding or modifying a cluster. To add a user
that authenticates with the API server using a username and password, run the follow-
ing command:
$ kubectl config set-credentials foo --username=foo --password=pass
Using kubectl with multiple clusters or namespaces 537
Both these examples store user credentials under the name foo. If you use the same
credentials for authenticating against different clusters, you can define a single user
and use it with both clusters.
TYING CLUSTERS AND USER CREDENTIALS TOGETHER
A context defines which user to use with which cluster, but can also define the name-
space that kubectl should use, when you don’t specify the namespace explicitly with
the --namespace or -n option.
The following command is used to create a new context that ties together the clus-
ter and the user you created:
$ kubectl config set-context some-context --cluster=my-other-cluster
➥ --user=foo --namespace=bar
This creates a context called some-context that uses the my-other-cluster cluster
and the foo user credentials. The default namespace in this context is set to bar.
You can also use the same command to change the namespace of your current
context, for example. You can get the name of the current context like so:
$ kubectl config current-context
minikube
Running this simple command once is much more user-friendly compared to having
to include the --namespace option every time you run kubectl.
TIP To easily switch between namespaces, define an alias like this: alias
kcd='kubectl config set-context $(kubectl config current-context)
--namespace '. You can then switch between namespaces with kcd some-
namespace.
■ --server to specify the URL of a different server (which isn’t in the config file).
■ --namespace to use a different namespace.
As you can see, I’m using three different contexts. The rpi-cluster and the rpi-foo
contexts use the same cluster and credentials, but default to different namespaces.
Listing clusters is similar:
$ kubectl config get-clusters
NAME
rpi-cluster
minikube
and
539
540 APPENDIX B Setting up a multi-node cluster with kubeadm
TIP When you click into the VM’s window, your keyboard and mouse will be
captured by the VM. To release them, press the key shown at the bottom-right
corner of the VirtualBox window the VM is running in. This is usually the Right
Control key on Windows and Linux or the left Command key on MacOS.
First, click Installation Destination and then immediately click the Done button on
the screen that appears (you don’t need to click anywhere else).
Then click on Network & Host Name. On the next screen, first enable the network
adapter by clicking the ON/OFF switch in the top right corner. Then enter the host
Setting up the OS and required packages 543
name into the field at the bottom left, as shown in figure B.5. You’re currently setting
up the master, so set the host name to master.k8s. Click the Apply button next to the
text field to confirm the new host name.
Figure B.5 Setting the hostname and configuring the network adapter
To return to the main setup screen, click the Done button in the top-left corner.
You also need to set the correct time zone. Click Date & Time and then, on the
screen that opens, select the Region and City or click your location on the map. Return
to the main screen by clicking the Done button in the top-left corner.
RUNNING THE INSTALL
To start the installation, click the Begin Installation button in the bottom-right corner.
A screen like the one in figure B.6 will appear. While the OS is being installed, set the
Figure B.6 Setting the root password while the OS is being installed and rebooting afterward
544 APPENDIX B Setting up a multi-node cluster with kubeadm
root password and create a user account, if you want. When the installation completes,
click the Reboot button at the bottom right.
# setenforce 0
But this only disables it temporarily (until the next reboot). To disable it perma-
nently, edit the /etc/selinux/config file and change the SELINUX=enforcing line to
SELINUX=permissive.
NOTE Make sure no whitespace exists after EOF if you’re copying and pasting.
As you can see, you’re installing quite a few packages. Here’s what they are:
■ docker —The container runtime
■ kubelet —The Kubernetes node agent, which will run everything for you
■ kubeadm —A tool for deploying multi-node Kubernetes clusters
■ kubectl —The command line tool for interacting with Kubernetes
■ kubernetes-cni—The Kubernetes Container Networking Interface
Once they’re all installed, you need to manually enable the docker and the kubelet
services:
# systemctl enable docker && systemctl start docker
# systemctl enable kubelet && systemctl start kubelet
DISABLING SWAP
The Kubelet won’t run if swap is enabled, so you’ll disable it with the following
command:
# swapoff -a && sed -i '/ swap / s/^/#/' /etc/fstab
# shutdown now
CLONING THE VM
Now, right-click on the VM in the VirtualBox UI and select Clone. Enter the name for
the new machine as shown in figure B.7 (for example, k8s-node1 for the first clone or
k8s-node2 for the second one). Make sure you check the Reinitialize the MAC address
of all network cards option, so each VM uses different MAC addresses (because
they’re going to be located in the same network).
546 APPENDIX B Setting up a multi-node cluster with kubeadm
Click the Next button and then make sure the Full clone option is selected before
clicking Next again. Then, on the next screen, click Clone (leave the Current machine
state option selected).
Repeat the process for the VM for the second node and then start all three VMs by
selecting all three and clicking the Start icon.
CHANGING THE HOSTNAME ON THE CLONED VMS
Because you created two clones from your master VM, all three VMs have the same host-
name configured. Therefore, you need to change the hostnames of the two clones. To
do that, log into each of the two nodes (as root) and run the following command:
192.168.64.138 master.k8s
192.168.64.139 node1.k8s
192.168.64.140 node2.k8s
Configuring the master with kubeadm 547
You can get each node’s IP by logging into the node as root, running ip addr and
finding the IP address associated with the enp0s3 network adapter, as shown in the fol-
lowing listing.
# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state
UP qlen 1000
link/ether 08:00:27:db:c3:a4 brd ff:ff:ff:ff:ff:ff
inet 192.168.64.138/24 brd 192.168.64.255 scope global dynamic enp0s3
valid_lft 59414sec preferred_lft 59414sec
inet6 fe80::77a9:5ad6:2597:2e1b/64 scope link
valid_lft forever preferred_lft forever
The command’s output in the previous listing shows that the machine’s IP address is
192.168.64.138. You’ll need to run this command on each of your nodes to get all
their IPs.
# kubeadm init
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production
clusters.
[init] Using Kubernetes version: v.1.8.4
...
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
http://kubernetes.io/docs/admin/addons/
You can now join any number of machines by running the following on each node
as root:
NOTE Write down the command shown in the last line of kubeadm init’s out-
put. You’ll need it later.
548 APPENDIX B Setting up a multi-node cluster with kubeadm
Kubeadm has deployed all the necessary Control Plane components, including etcd,
the API server, Scheduler, and Controller Manager. It has also deployed the kube-
proxy, making Kubernetes services available from the master node.
# export KUBECONFIG=/etc/kubernetes/admin.conf
LISTING NODES
You’re finished with setting up the master, but you still need to set up the nodes.
Although you already installed the Kubelet on both of your two worker nodes (you
either installed each node separately or cloned the initial VM after you installed all
the required packages), they aren’t part of your Kubernetes cluster yet. You can see
that by listing nodes with kubectl:
# kubectl get node
NAME STATUS ROLES AGE VERSION
master.k8s NotReady master 2m v1.8.4
Configuring worker nodes with kubeadm 549
See, only the master is listed as a node. And even the master is shown as being Not-
Ready. You’ll see why later. Now, you’ll set up your two nodes.
Listing B.6 Last part of the output of the kubeadm init command
You can now join any number of machines by running the following on each node
as root:
All you need to do is run the kubeadm join command with the specified token and the
master’s IP address/port on both of your nodes. It then takes less than a minute for
the nodes to register themselves with the master. You can confirm they’re registered
by running the kubectl get node command on the master again:
Okay, you’ve made progress. Your Kubernetes cluster now consists of three nodes, but
none of them are ready. Let’s investigate.
Let’s use the kubectl describe command in the following listing to see more
information. Somewhere at the top, you’ll see a list of Conditions, showing the
current conditions on the node. One of them will show the following Reason and
Message.
Listing B.7 Kubectl describe shows why the node isn’t ready
According to this, the Kubelet isn’t fully ready, because the container network (CNI)
plugin isn’t ready, which is expected, because you haven’t deployed the CNI plugin
yet. You’ll deploy one now.
550 APPENDIX B Setting up a multi-node cluster with kubeadm
This will deploy a DaemonSet and a few security-related resources (refer to chapter 12
for an explanation of the ClusterRole and ClusterRoleBinding, which are deployed
alongside the DaemonSet).
Once the DaemonSet controller creates the pods and they’re started on all your
nodes, the nodes should become ready:
# k get node
NAME STATUS ROLES AGE VERSION
master.k8s Ready master 9m v1.8.4
node1.k8s Ready <none> 5m v1.8.4
node2.k8s Ready <none> 5m v1.8.4
And that’s it. You now have a fully functioning three-node Kubernetes cluster with an
overlay network provided by Weave Net. All the required components, except for the
Kubelet itself, are running as pods, managed by the Kubelet, as shown in the follow-
ing listing.
Listing B.8 System pods in the kube-system namespace after deploying Weave Net
Replace the IP with that of your master. Then you point the KUBECONFIG environment
variable to the ~/.kube/config2 file like this:
$ export KUBECONFIG=~/.kube/config2
Kubectl will now use this config file. To switch back to using the previous one, unset
the environment variable.
You’re now all set to use the cluster from your local machine.
appendix C
Using other container
runtimes
C.1 Replacing Docker with rkt
We’ve mentioned rkt (pronounced rock-it) a few times in this book. Like Docker, it
runs applications in isolated containers, using the same Linux technologies as
those used by Docker. Let’s look at how rkt differs from Docker and how to try it in
Minikube.
The first great thing about rkt is that it directly supports the notion of a Pod
(running multiple related containers), unlike Docker, which only runs individual
containers. Rkt is based on open standards and was built with security in mind from
the start (for example, images are signed, so you can be sure they haven’t been tam-
pered with). Unlike Docker, which initially had a client-server based architecture
that didn’t play well with init systems such as systemd, rkt is a CLI tool that runs
your container directly, instead of telling a daemon to run it. A nice thing about rkt
is that it can run existing Docker-formatted container images, so you don’t need to
repackage your applications to get started with rkt.
552
Replacing Docker with rkt 553
NOTE You may need to run minikube delete to delete the existing Minikube
VM first.
When the pod starts up, you can see it’s running through rkt by inspecting its contain-
ers with kubectl describe, as shown in the following listing.
You can also try hitting the pod’s HTTP port to see if it’s responding properly to
HTTP requests. You can do this by creating a NodePort Service or by using kubectl
port-forward, for example.
INSPECTING THE RUNNING CONTAINERS IN THE MINIKUBE VM
To get more familiar with rkt, you can try logging into the Minikube VM with the fol-
lowing command:
$ minikube ssh
554 APPENDIX C Using other container runtimes
Then, you can use rkt list to see the running pods and containers, as shown in the
following listing.
$ rkt list
UUID APP IMAGE NAME STATE ...
4900e0a5 k8s-dashboard gcr.io/google_containers/kun... running ...
564a6234 nginx-ingr-ctrlr gcr.io/google_containers/ngi... running ...
5dcafffd dflt-http-backend gcr.io/google_containers/def... running ...
707a306c kube-addon-manager gcr.io/google-containers/kub... running ...
87a138ce kubia registry-1.docker.io/luksa/k... running ...
d97f5c29 kubedns gcr.io/google_containers/k8s... running ...
dnsmasq gcr.io/google_containers/k8...
sidecar gcr.io/google_containers/k8...
You can see the kubia container, as well as other system containers running (the ones
deployed in pods in the kube-system namespace). Notice how the bottom two con-
tainers don’t have anything listed in the UUID or STATE columns? That’s because they
belong to the same pod as the kubedns container listed above them.
Rkt prints containers belonging to the same pod grouped together. Each pod
(instead of each container) has its own UUID and state. If you tried doing this when
you were using Docker as the Container Runtime, you’ll appreciate how much easier
it is to see all the pods and their containers with rkt. You’ll notice no infrastructure
container exists for each pod (we explained them in chapter 11). That’s because of
rkt’s native support for pods.
LISTING CONTAINER IMAGES
If you’ve played around with Docker CLI commands, you’ll get familiar quickly with
rkt’s commands. Run rkt without any arguments and you’ll see all the commands you
can run. For example, to list container images, you run the command in the follow-
ing listing.
These are all Docker-formatted container images. You can also try building images in
the OCI image format (OCI stands for Open Container Initiative) with the acbuild
Using other container runtimes through the CRI 555
556
Understanding the architecture 557
capacity, letting the application spill over to a cloud-based cluster, which is automati-
cally provisioned on the cloud provider’s infrastructure.
Federation
etcd Federation Controller Manager
API server
Controller Controller
Manager Manager
Worker node Worker node Worker node Worker node Worker node Worker node
Other
locations
San Francisco London
Federated resources
Deployment X Deployment X
Secret Y Secret Y
Replicas: 3 Replicas: 2
Figure D.2 The relationship between federated resources and regular resources in underlying clusters
Understanding federated API objects 559
561
562 INDEX
API servers (continued) app=kubia label 98, 123 copying logs to and from
connecting to 503 app=pc label selector 72 containers 500–501
connecting to cluster-internal application logs 65–66 handling multi-line log
services through 299 application program interface. See statements 502
exploring 235–236, 248 API using centralized
finding addresses 239–240 application templates, in Red Hat logging 501
modifying resources in OpenShift Container highly available 341
requests 317 platform 528–529 horizontally scaling 48–50
notifying clients of resource applications increasing replica count 49
changes 318–319 accessing 32 requests hitting three pods
persistently storing best practices for when hitting services 50
resources 318 developing 477, 502, results of scale-out 49–50
running multiple instances 506–507 visualizing new states 50
of 343 auto-deploying resource implementing shutdown han-
running static pods manifests 504–505 dlers in 490–491
without 326–327 employing CI/CD 506 in containers, limits as seen
securing clusters with ensuring client requests are by 415–416
RBAC 353–373 handled properly 492– killing 479–482
binding roles to service 497 expecting data written to disk
accounts 359–360 Ksonnet as alternative to to disappear 480
default ClusterRoleBindings writing JSON/YAML expecting hostnames to
371–373 manifests 505–506 change 480
default ClusterRoles 371–373 pod lifecycles 479–491 expecting local IP to
granting authorization using Minikube 503–504 change 480
permissions 373 versioning resource using volumes to preserve
including service accounts manifests 504–505 data across container
from other namespaces compromised 373 restarts 480–482
in RoleBinding 361 containerized, making container images 497
RBAC authorization configuring 191–192 managing 497–502
plugins 353–354 creating 290–291 monolithic, vs. microservices
RBAC resources 355–357 creating versions of 268 3–6
using ClusterRoleBindings deploying through multi-tier, splitting into multiple
362–371 StatefulSets 291–295 pods 59
using ClusterRoles 362–371 creating governing Node.js
using RoleBindings 358–359 services 292–294 creating 28–29
using Roles 358–359 creating persistent deploying 42–44
simplifying communication volumes 291–292 non-horizontally scalable, using
243–245 creating StatefulSets 294 leader-election for 341
ambassador container examining PersistentVolume- overview 478–479
patterns 244 Claims 295 providing consistent environ-
running curl pod 244–245 examining stateful pods ment to 6
using client libraries to commu- 294–295 providing information about pro-
nicate with 246–248 describing resources through cess termination 498–500
building libraries with annotations 498 relocating 479–482
OpenAPI 248 descriptions, effect on running expecting data written to disk
building libraries with containers 19–20 to disappear 480
Swagger 248 examining nodes 51–52 expecting hostnames to
interacting with FABRIC8 displaying pod IP when list- change 480
Java client 247–248 ing pods 51 expecting local IP to
using existing client displaying pod node when change 480
libraries 246–247 listing pods 51 using volumes to preserve
using custom service account inspecting details of pod with data across container
tokens to communicate kubectl describe 52 restarts 480–482
with 352–353 exposing through services using running 19–21, 497–502
validating resources 318 single YAML file 255 in VMs instead of
verifying identity of 240–241 handling logs 500–502 containers 555
See also REST API copying files to and from keeping containers
apiVersion property 106, 358, 510 containers 500–501 running 20–21
INDEX 563
applications (continued) creating Secrets for, with to host ports without using
locating containers 21 Docker registries 223 host network namespaces
outside of Kubernetes during with API servers 241–242 377–379
development 502–503 authorization plugins, RBAC blocking rollouts 274–278
scaling number of copies 21 353–354 configuring deadlines for
through services using single authorizations rollouts 278
YAML file 255 clients with plugins for 317 defining readiness probes to
shut-down procedures 496–497 granting permissions 373 prevent rollouts 275
splitting into microservices 3–4 service accounts tying into 349 minReadySeconds 274–275
tagging images 497–498 auto-deploying, resource preventing rollouts with readi-
updating declaratively using manifests 504–505 ness probes 277–278
Deployment 261–278 automatic scaling 23 updating deployments with
blocking rollouts of bad automating custom resources kubectl apply 276–277
versions 274–278 with custom controllers blue-green deployment 253
controlling rate of 513–517 Borg 16
rollout 271–273 running controllers as brokers. See service brokers
creating Deployments pods 515–516 BuildConfigs, building images
262–264 Website controllers 514–515 from source using 529
pausing rollout process Autoscaler, using to scale up Burstable QoS class, assigning to
273–274 Deployments 446–447 pods 418
rolling back autoscaling busybox image 26, 112
deployments 268–270 metrics appropriate for 450
updating Deployments process of 438–441 C
264–268 calculating required number
using imagePullPolicy 497–498 of pods 439 -c option 66, 245
using multi-dimensional labels obtaining pod metrics CA (certificate authority) 170,
vs. single-dimensional 438–439 240
labels 498 updating replica count on --cacert option 240
using pre-stop hooks when not scaled resources 440 cAdvisor 431
receiving SIGTERM See also horizontal autoscaling canary release 69, 273
signal 488–489 availability zones capabilities
args array 196 co-locating pods in same 471 adding to all containers 395
arguments deploying pods in same dropping from containers
defining in Docker 193–195 471–472 385–386, 395
CMD instruction 193 availability-zone label 467 kernel, adding to
ENTRYPOINT instruction AWS (Amazon Web Services) 37, containers 384–385
193 174, 454 specifying which can be added
making INTERVAL configu- awsElasticBlockStore volume 162, to containers 394
rable in fortune images 174 See also allowed capabilities
194–195 azureDisk volume 162, 174 capacity, of nodes 407–408
shell forms vs. exec azureFile volume 174 CAP_CHOWN capability 385, 395
forms 193–194 CAP_SYS_TIME capability 384
overriding in Kubernetes B categorizing worker nodes with
195–196 labels 74
See also command-line argu- backend services, connecting CentOS ISO image 541
ments to 502 central processing units. See CPUs
asterisks 117 backend-database 129–130 cephfs volume 163
at-most-one semantics 290 base images 29 cert files, from Secrets 221
Attribute-based access control. See bash process 33 certificate-authority-data field 536
ABAC batch API groups, REST certificates, TLS 147–149
authenticating endpoints 236–237 CI/CD (Continuous Integration
API servers 346–353 BestEffort class, assigning pods and Continuous
groups 347–348 to 417 Delivery) 506
service accounts 348, binary data, using Secrets for 217 cinder volume 163
351–353 binding claiming
users 347–348 Roles to service accounts benefits of 182
clients with authentication 359–360 PersistentVolumes 179–181
plugins 317 service instances 525 client binaries, downloading 39
564 INDEX
container runtimes 552–555 Init containers, adding to using lifecycle hooks when
replacing Docker with rkt pods 484–485 application not receiving
552–555 instead of VMs, running appli- SIGTERM signal
configuring Kubernetes to cations in 555 488–489
use rkt 552 isolating filesystems 34 preventing processes from writ-
using rkt with Minikube limiting resource available ing to filesystems 386–387
553–555 to 412–416 processes running in host oper-
using through CRI 555 limits as seen by applications ating system 34
CRI-O Container Runtime in 415–416 remotely executing commands
555 Linux, isolating components in 124–126
running applications in VMs with 8 removing 34–35
instead of listing all 32 rkt container platform, as alter-
containers 555 locating 21 native to Docker 15–16
CONTAINER_CPU_REQUEST making images 497 running 20–21
_MILLICORES variable 228 mechanisms for isolation 11 effect of application descrip-
ContainerCreating 486 mounting local files into 503 tions on 19–20
CONTAINER_MEMORY_LIMIT multiple vs. one with multiple inspecting in Minikube
_KIBIBYTES variable 228 processes 56–57 VM 553–554
--container-runtime=rkt option obtaining information about 33 viewing environment of 33–34
552–553 of pods running applications inside
containers 7–16 requesting resources during development 503
adding capabilities to all 395 for 405–412 running shells in 130–131
adding individual kernel capa- resources requests for seeing all node CPU cores 416
bilities to 384–385 411–412 seeing node memory 415–416
comparing VMs to 8–10 running with kubelet 332 setting environment variables
configuring security organizing across pods 58–60 for 196–198
contexts 380–389 splitting into multiple pods configuring INTERVAL in
preventing containers from for scaling 59 fortune images through
running as root 382 splitting multi-tier applica- environment variables
running containers as specific tions into multiple 197
user 381–382 pods 59 disadvantages of hardcoding
running pods in privileged overview 8–12 environment
mode 382–384 isolating processes with Linux variables 198
running pods without Namespaces 11 referring to environment
specifying security limiting resources available to variables in variable
contexts 381 process 11–12 values 198
copying files to and from partial isolation between 57 setting hard limits for resources
500–501 passing command-line argu- used by 412–413
copying logs to and from ments to 192–196 creating pods with resource
500–501 defining arguments in limits 412–413
determining QoS class of 418 Docker 193–195 overcommitting limits 413
Docker container platform defining command in setting up networks 550
12–15 Docker 193–195 sharing data between with
building images 13 overriding arguments in volumes 163–169
comparing VMs to 14–15 Kubernetes 195–196 using emptyDir volume
concepts 12–13 overriding command in 163–166
distributing images 13 Kubernetes 195–196 using Git repository as
image layers 15 running fortune pods with starting point for
portability limitations of con- custom interval 195–196 volume 166–169
tainer images 15 passing ConfigMap entries sharing IP 57
running images 13 to 202–203 sharing port space 57
dropping capabilities pods with multiple, determin- sharing volumes when running
from 385–386, 395 ing QoS classes of 419 as different users 387–389
existing, running shells inside 33 post-start, using lifecycle specifying environment vari-
exploring 33 hooks 486–487 ables in definitions 197
images pre-stop specifying name when retriev-
creating 290–291 using lifecycle hooks ing logs of multi-container
listing 554–555 487–488 pods 66
568 INDEX
kube-proxy 19, 312, 327, 330, 345 keeping containers multi-dimensional vs. single-
overview 339 running 20–21 dimensional 498
using iptables 339–340 locating containers 21 of existing pods, modifying
kube-public namespace 77 scaling number of copies 21 70–71
Kubernetes 16–24 when indicated 2–7 of managed pods 99–100
architecture of 310–330 continuous delivery 6–7 organizing pods with 67, 71
add-ons 328–330 microservices vs. monolithic overview 68–69
API servers 316–319 applications 3–6 removing from nodes 111–112
clusters 18–19 providing consistent applica- specifying when creating
components 310–312 tion environment 6 pods 69–70
components of Control Kubernetes CNI (Container Net- updating 232
Plane 310 working Interface) 545 LAN (local area network) 58
components running on Kubernetes Control Plane 18 latest tag 28
worker nodes 310 KUBERNETES_SERVICE_HOST --leader-elect option 343
Controller 321–326 variable 239 leader-election
etcd 312–316 KUBERNETES_SERVICE_PORT for non-horizontally scalable
kubelet 326–327 variable 239 applications 341
Scheduler 319–321 kube-scheduler resource 344 using in Control Plane
Service Proxy 327–328 kube-system namespace 77, 312, components 344
benefits of using 21–24 314, 344, 454, 548 LeastRequestedPriority 407
automatic scaling 23 kubia 42 libraries
health checking 22–23 kubia-2qneh 100 building with OpenAPI 248
improving hardware kubia-container 32–33 building with Swagger 248
utilization 22 kubia-dmdck 100 See also client libraries
self-healing 22–23 kubia-gpu.yaml file 74 lifecycles, of pods 479–491
simplifying application kubia-manual 63 adding lifecycle hooks 485–489
deployment 21–22 kubia-rc.yaml file 93 hooks
simplifying application KUBIA_SERVICE_HOST adding 485–489
development 23–24 variable 129 targeting containers with 489
dashboard 52–53 KUBIA_SERVICE_PORT using post-start container
accessing when running in variable 129 486–487
managed GKE 52 kubia-svc.yaml file 123 using pre-stop container
accessing when using kubia-website service 517 487–488
Minikube 53 using pre-stop when applica-
installing 544–545 L tion not receiving
adding Kubernetes yum SIGTERM signal
repo 544 label selectors 488–489
disabling firewalls 544 changing for Replication- killing applications 479–482
disabling SELinux 544 Controller 100–101 pod shutdowns 489–491
enabling net.bridge.bridge- deleting pods using 80 relocating applications 479–482
nf-call-iptables Kernel effect of changing 93 rescheduling dead pods
option 545 listing pods using 71–72 482–483
master, checking node status as listing subsets of pods rescheduling partially dead
seen by 305 through 71–72 pods 482–483
origins of 16 ReplicaSets 107 starting pods in specific
overriding arguments in using multiple conditions in 72 order 483–485
195–196 labels limiting resources
overriding commands in adding to nodes 111 available in namespaces 425–429
195–196 adding to pods managed by limiting objects that can be
overview 16–18 ReplicationControllers 99 created 427–428
focusing on core application categorizing worker nodes ResourceQuota
features 17–18 with 74 resources 425–427
improving resource constraining pod scheduling specifying quotas for per-
utilization 18 with 73–75 sistent storage 427
running applications in 19–21 categorizing worker nodes specifying quotas for specific
effect of application descrip- with labels 74 pod states 429
tion on running scheduling pods to specific specifying quotas for specific
containers 19–20 nodes 74–75 QoS classes 429
INDEX 577
microservices 3–6 retrieving resource usages granting full control of, with
deploying 5 430–432 admin ClusterRole 372
divergence of environment storing historical resource grouping resources with 76–80
requirements 5–6 consumption statistics host network, binding to host
scaling 4 432–434 ports without using
splitting applications into 3–4 monolithic applications, vs. 377–379
minAvailable field 455 microservices 3–6 host nodes, using in pods
minikube delete command 553 MostRequestedPriority 407 376–380
minikube mount command 503 Mount (mnt) Namespace 11 including service accounts in
minikube node 408 mountable Secrets 350–351 RoleBindings 361
Minikube tool mounted config files, verifying IPC, using 379–380
accessing dashboard when Nginx using 208 isolating networks between
using 53 mounting 401–402
and GKE, switching ConfigMap entries as files isolation provided by 79–80
between 534–535 210–211 limiting resources available
combining with Kubernetes directories hides existing in 425–429
clusters 504 files 210 limiting objects that can be
inspecting running containers fortune-https Secret in created 427–428
in 553–554 pods 219–220 ResourceQuota resources
installing 37 local files into containers 503 425–427
running local single-node local files into Minikube specifying quotas for per-
Kubernetes clusters with VM 503 sistent storage 427
37–38 multi-tier applications, splitting specifying quotas for specific
starting clusters with 37–38 into multiple pods 59 pod states 429
switching from GKE to 534 MustRunAs rules, using 392–393 specifying quotas for specific
switching to GKE from 534 MustRunAsNonRoot rules, using QoS classes 429
using in development 503–504 in runAsUser fields 394 Linux, isolating processes
building images locally 504 my-nginx-config.conf entry 209 with 11
copying images to Minikube my-postgres-db service 525 managing objects in 79
VM directly 504 MYSQL_ROOT_PASSWORD node network, using in
mounting local files into variable 192 pods 376–377
containers 503 node PID 379–380
mounting local files into N of pods 242
Minikube VM 503 pods in, allowing some to con-
using Docker daemon inside -n flag 79 nect to server pods 400
Minikube VM to build -n option 537 setting default limits for pods
images 503–504 names per 421–425
using rkt with 553–555 configuring resolution for applying default resource
listing container images hosts 546–547 limits 424–425
554–555 deleting pods by 80 creating LimitRange
running pods 553 of containers 66 objects 422–423
Minikube VM (virtual machine) retrieving job instances by 238 enforcing limits 423–424
directly copying images to 504 names.kind property 511 LimitRange resources 421–422
mounting local files into 503 Namespace controllers 325 setting default requests for pods
using Docker daemon inside, to --namespace option 535, 537 per 421–425
build images 503–504 NamespaceLifecycle 317 applying default resource
Minishift 530 namespaces requests 424–425
minReadySeconds attribute 265, accessing resources in, using creating LimitRange
274–275 ClusterRoles 367–370 objects 422–423
MongoDB database, adding docu- creating enforcing limits 423–424
ments to 173 from YAML files 78 LimitRange resources
monitoring pod resource with kubectl create 421–422
usage 430–434 namespace 78–79 using kubectl with multiple
analyzing historical resource deleting pods in while 535–538
consumption statistics keeping 81–82 adding kube config entries
432–434 discovering pods of 77–78 536–537
collecting resource usages enabling network isolation configuring location of
430–432 in 399 kubeconfig files 535
INDEX 579
QoS classes (continued) hitting service with single reloading config, signaling Nginx
killing processes when memory ready pod 153 to 212
is low 420–421 modifying pod readiness relocating applications 479–482
handling containers with status 152 expecting data written to disk to
same QoS class 420–421 observing pod readiness disappear 480
sequence of QoS classes 420 status 152 expecting hostnames to
of containers, determining 418 benefits of using 151 change 480
of pods with multiple contain- defining 153 expecting local IP to
ers, determining 419 defining to prevent rollouts 275 change 480
sequence of 420 including pod shutdown logic using volumes to preserve data
specifying quotas for 429 in 153 across container
QPS (Queries-Per-Second) 439, operation of 150 restarts 480–482
449 overview 149–151 removing
quobyte volume 163 preventing rollouts with 277–278 containers 34–35
quorum 315 types of 150 labels from nodes 111–112
quotas, specifying 427–429 reading, Secret entries in pods from controllers 100
pods 218 repelling
R read-only access, to resources with pods from nodes with
view ClusterRole 372 taints 457–462
racks, deploying pods in 471–472 ReadOnlyMany access mode 180 adding custom taints to
RBAC (role-based access readOnlyRootFilesystem node 460
control) 242 property 387, 392 taints, overview 458–460
authorization plugins 353–354 ReadWriteMany access mode 180 using taints 461–462
plugins 354 ReadWriteOnce access mode 180 pods from nodes with
resources 355–357 READY column 44, 152 tolerations 457–462
creating namespaces 357 reconciliation loops 92 adding tolerations to
enabling in clusters 356 records, DNS 155–156 pods 460–461
listing services from pods 357 re-creating pods 173–174 tolerations, overview 458–460
running pods 357 recycling PersistentVolumes using tolerations 461–462
securing clusters with 353–373 183–184 replacing old pods
binding roles to service automatically 183–184 by scaling two Replication-
accounts 359–360 manually 183 Controllers 259–260
default ClusterRoleBindings Red Hat OpenShift Container overview 252
371–373 platform 527–530 replica count 49, 92
default ClusterRoles application templates 528–529 replicas
371–373 automatically deploying newly running multiple with separate
granting authorization built images with storage for each 281–282
permissions 373 DeploymentConfigs creating pods manually 281
including service accounts 529–530 using multiple directories in
from other namespaces building images from source same volume 282
in RoleBinding 361 using BuildConfigs 529 using one ReplicaSet per pod
RBAC authorization exposing services externally instance 281–282
plugins 353–354 using Routes 530 scaling down to zero 450–451
RBAC resources 355–357 groups 528 updating count on scaled
using ClusterRoleBindings projects 528 resources 440
362–371 resources available in 527–528 ReplicaSets 104–108, 324, 559
using ClusterRoles 362–371 users 528 controllers
using RoleBindings 358–359 using 530 creating pods with 332
using Roles 358–359 referencing non-existing config creating with Deployment
using to assign different Pod- maps in pods 203 controller 331
SecurityPolicies to differ- registering creating 106–107, 263–264
ent users 397–398 custom API servers 519 defining 105–106
rbd volume 163 service brokers in Service examining 106–107
rc (replicationcontroller) 46, 95 Catalog 522–523 using label selectors 107
readiness probes registries, creating Secrets for using one per pod
adding to pods 151–153 authenticating with 223 instance 281–282
adding readiness probe to rel=canary label 80 vs. ReplicationControllers 105
pod template 151–152 relinquishing, nodes 453 vs. StatefulSets 284–285
586 INDEX
replicating stateful pods 281–284 preventing broken client con- inspecting Quota and Quota
providing stable identity for nections when pod shuts usage 426
pods 282–284 down 493–497 resources
running multiple replicas with preventing broken client con- accessing in specific
separate storage for nections when pod starts namespaces 367–370
each 281–282 up 492–493 allowing modification of
replication controllers 47–48 modifying resources in, with resources with edit
replication managers 323–324 admission control ClusterRole 372
ReplicationControllers 90–104 plugins 317 analyzing statistics for historical
benefits of using 93 sending to pods 66–67 consumption of 432–434
changing pod templates 101 connecting to pods through Grafana 432
creating 93–94 port forwarders 67 InfluxDB 432
creating new pods with 96 forwarding local network running Grafana in
deleting 103–104 port to port in pod 67 clusters 433
getting information about See also CPU requests running InfluxDB in
95–96 requiredDropCapabilities 394–395 clusters 433
horizontally scaling pods requiredDuringScheduling 464, using information shown in
102–103 475 charts 434
in action 94–98 rescaling automatically 444–445 analyzing usage with
moving pods in and out of rescheduling Grafana 433–434
scope of 98–101 dead pods 482–483 auto-deploying manifests
adding labels to pods man- partially dead pods 482–483 504–505
aged by 99 resource limits, creating pods available in namespaces,
changing label selectors with 412–413 limiting 425–429
100–101 resource metric types, scaling limiting objects that can be
changing labels of managed based on 449 created 427–428
pod 99–100 resource requests ResourceQuota
removing pods from automatically configuring 451 resources 425–427
controllers 100 creating pods with 405–406 specifying quotas for per-
operation of 91–93 default, applying 424–425 sistent storage 427
parts of 92–93 defining custom resources specifying quotas for specific
performing automatic 411–412 pod states 429
rolling updates with effect on scheduling 406–410 specifying quotas for specific
254–261 creating pods that don’t fit on QoS classes 429
obsolescence of kubectl any node 408–409 available to containers,
rolling-update freeing resources to schedule limiting 412–416
260–261 pods 410 exceeding limits 414–415
performing rolling updates inspecting node capacity limits as seen by applications
with kubectl 256–260 407–408 in containers 415–416
running initial version of pods not being scheduled cluster-level, allowing access
applications 254–255 409–410 to 362–365
reconciliation loops 92 Scheduler determining collecting usage 430–432
replacing old pods with new whether pod can fit on displaying CPU usage for
pods by scaling 259–260 node 406 cluster nodes 431
responding to deleted pods 95 Scheduler using pod requests displaying CPU usage for
responding to node when selecting best individual pods 431–432
failures 97–98 nodes 407 displaying memory usage for
scaling by editing definitions for pod containers 411–412 cluster nodes 431
102–103 modifying while pod is displaying memory usage for
scaling up 102 running 451–452 individual pods 431–432
vs. ReplicaSets 105 resourceFieldRef 233 enabling Heapster 431
vs. StatefulSets 284–285 resourceNames field 358 custom
repo files, adding to yum package ResourceQuota resources automating with custom
manager 544 425–427 controllers 513–517
requests creating for CPUs 425–426 creating instances of
for pods per namespace, setting creating for memory 425–426 511–512
defaults 421–425 creating LimitRange along with retrieving instances of 512
from clients, handling 492–497 ResourceQuota 427 deleting in namespace 82
INDEX 587
PodDisruptionBudget (pdb) Defines the minimum number of pods that must 15.3.3
[policy/v1beta1] remain running when evacuating nodes
LimitRange (limits) [v1] Defines the min, max, default limits, and default 14.4
Resources
Event (ev) [v1] A report of something that occurred in the cluster 11.2.3
Kubernetes IN ACTION
Marko Lukša SEE INSERT
K
ubernetes is Greek for “helmsman,” your guide through
unknown waters. The Kubernetes container orchestra-
tion system safely manages the structure and flow of a
distributed application, organizing containers and services for
maximum efficiency. Kubernetes serves as an operating system
“ Authoritative and
exhaustive. In a hands-on
for your clusters, eliminating the need to factor the underlying style, the author teaches how
network and server infrastructure into your designs. to manage the complete
lifecycle of any distributed
Kubernetes in Action teaches you to use Kubernetes to deploy
container-based distributed applications. You’ll start with an
overview of Docker and Kubernetes before building your first
and scalable application.
—Antonio Magnaghi, System1 ”
Kubernetes cluster. You’ll gradually expand your initial
application, adding features and deepening your knowledge
of Kubernetes architecture and operation. As you navigate “world
The best parts are the real-
examples. They don’t
this comprehensive guide, you’ll explore high-value topics just apply the concepts,
like monitoring, tuning, and scaling. they road test them.
”
—Paolo Antinori, Red Hat
What’s Inside
Kubernetes’ internals
“of AnKubernetes
in-depth discussion
●
●
Securing clusters
Updating applications with zero downtime
technologies. A must-have!
—Al Krinker, USPTO ”
Written for intermediate software developers with little or no
familiarity with Docker or container orchestration systems. “aThe full path to becoming
professional Kubernaut.
Marko Lukša is an engineer at Red Hat working on Kubernetes
and OpenShift.
Fundamental reading.
—Csaba Sári ”
Chimera Entertainment