Piconet

Kubernetes Core Components

There’s been a few articles pop up lately that advocate Nomad. This isn’t an argument against using Nomad over Kubernetes. I just want to jot down some thoughts about what makes a kubernetes cluster. This'll be written with the presumption the reader has at least dabbled with kubernetes. I don’t have it in me to go over what a ‘pod’ is.

I Wont be talking about CNI, or the abstractions that exist over pods. We're sticking to just the important pods that make up a cluster.

The Architecture

Kubernetes has a broad architecture. There’s a lot of individual processes involved, and you need to care about the health of each and every one.

To be clear - I’m not advocating understanding how every single component works here. But I am advocating that if you want to use Kubernetes, you at least know what they do, because when they go tits-up, you want to know at least two things:

What’s actually broken in your cluster
What does the log output mean

etcd

etcd is the datastore. This is where every bit of information is kept. You may have seen it elsewhere. It’s a well tested solid key:value storage solution built around RAFT concensus.

If it's unclear what I mean by 'every bit of information', I do mean everything. Every single object you'll interact with, from pods to roles to IP addresses to network policies - stored inside etcd.

coredns

coredns is the DNS server. Something necessary if you want containers to be able to reference the address of other containers both the names and addresses of pods are ephemeral. Although not as commonly seen outside of Kubernetes as etcd, it’s also an off-the-shelf solution.

kube-apiserver

The API server is your interface to the cluster, and also the cluster’s interface to the cluster. A nice way to consider it is as a REST API to etcd that also applies access control and object validation.

You'll find references to HA apiserver setups. This really just means there are multiple kube-apiserver pods running, and there's something distributing traffic in front. The thing distributing traffic could very well just be multiple A records pointing to the nodes running the pods, in your DNS zonefile.

kube-controller-manager

The controller-manager runs ‘controllers’. On a very, very high level, you can think of controllers as processes that watch etcd through the apiserver and do things when objects are added, removed, or changed.

Deployment object created? The deployment controller reads the object and creates some pod objects from the template it contains. Cronjob object created? The cronjob controller starts a pod defined in the object on the appropriate interval.

kube-scheduler

The scheduler handles figuring out where pods should run, given not all hosts are equal. Want your pods to run on a node with an SSD? What about requiring an SSD, and preferring to run on nodes in different datacentres. What about requiring nodes with an SSD, in different datacentres, preferring nodes that don’t already run a pod in that deployment?

kube-proxy

Kube-proxy takes more explaining. Here’s a working example that’s entirely contained to routing inside the cluster:

The scheduler puts an API pod (A, 10.99.19.19) on Host 1, and another API pod (B, 10.99.17.3) on Host 2. They’re both behind a service named ‘myapi’ (10.240.0.1). The addresses are arbitrary, and it doesn’t really matter what your API does.

You then run a pod (C, 10.11.3.6) that depends on the API pods. While the pods on A and B do have addresses, those addresses could change at any moment. Say you update the image tag, or change the configuration in a way that requires a restart. The API pods will be recreated with new addresses.

For that reason you instead direct your application in pod C to seek the API on the service address. Recall we put A and B behind a service named ‘myapi’. Doing this automatically created an A record in coredns named “myapi.svc.mynamespace.cluster.myk8s.tld”. That A record resolves to 10.240.0.1.

On the backend, when kube-proxy saw that service get created and it enumerated the addresses of pods referenced by it to generate DNAT iptables rules (presuming you're using the iptables backend). Thus creating a mapping for traffic sent to 10.240.0.1 to instead be sent to 10.99.19.19 or 10.99.17.3.

To keep this short - kube-proxy is responsible for managing IPtables to properly map services to their pods.

kubelet

Kubelet sits on each host in the cluster and does the talking to the container runtime (docker, for example). There’s not a lot to say about kubelet. It’s actually the component that I’ve had the fewest issues with. Even during upgrades.

The Bad

There’s a lot of it. A big argument for Nomad and K3s is that they are a single binary. Personally, there are certainly times I look at pods running in the kube-system namespace and think - “God, this list is huge”. In those instances, I yearn for the single-process-per-host model. Is it running and saying it’s fine? Great. Is it not running or saying something is wrong? Well, at least I know where to start looking.

In contrast, kubernetes can break in as many ways as their are combinations of components. There are more ways for Kubernetes to screw me over than there are stars in the sky. And yes, sure, splitting things out makes sense in a lot of scenarios. But for a small cluster they’re largely not applicable.

Setting up a new cluster is trivial with kubeadm. But it places very little requirements on the sysadmin, and leaves you with something almost black-box in nature. There’s no way to improve the situation on this front. Except maybe requiring kubeadm init ask you a series of exam questions. I’m being glib, I know.

The logs for all these components aren’t unified. Relating something in kubelet’s logs to something in the scheduler logs isn’t possible without effort. In the case of in-cluster components, it’s slightly better. But in this instance, they’re not even the same format. CNI (which I’ll cover in another post) compounds this issue.