There’s been a few articles pop up lately that advocate Nomad. This isn’t an argument against using Nomad over Kubernetes. Rather, this is the first of a series of posts I want to make about Kubernetes from the perspective of a system administrator. They’re all going to be written from angle the reader has at least dabbled with kubernetes. I don’t have it in me to go over what a ‘pod’ is.
This specific post is about the components that make up a kubernetes cluster, and some of the badness that comes out of its current architecture. There’s plenty to complain about, so I wont be talking about CNI or more complex routing concepts (like propelry handling asymmetric routing) quite yet.
Kubernetes has a broad architecture. There’s a lot of individual processes involved, and you need to care about the health of each and every one.
To be clear - I’m not advocating understanding how every single component works here. But I am advocating that if you want to use Kubernetes, you at least know what they do, because when they go tits-up, you want to know at least two things:
etcd is the datastore. This is where every bit of information is kept. You may have seen it elsewhere. It’s a well tested solid key:value storage solution built around RAFT concensus.
coredns is the DNS server. Something necessary if you want containers to be able to reference the address of other containers when those addresses are ephemeral. Although not as commonly seen outside of Kubernetes as etcd, it’s also an off-the-shelf solution I’m happy to see used.
The API server is your interface to the cluster, and also the cluster’s interface to the cluster. A nice way to consider it is as a REST API to etcd that also applies access control and object validation.
The controller-manager runs ‘controllers’. On a very, very high level, you can think of controllers as processes that watch etcd through the apiserver and do things when objects are added, removed, or changed.
Deployment object created? The deployment controller reads the object and creates some pod objects from the template it contains. Cronjob object created? The cronjob controller starts a pod defined in the object on the appropriate interval.
The scheduler handles figuring out where pods should run, given not all hosts are equal. Want your pods to run on a node with an SSD? What about requiring an SSD, and preferring to run on nodes in different datacentres. What about requiring nodes with an SSD, in different datacentres, preferring nodes that don’t already run a pod in that deployment?
Kube-proxy takes more explaining. Here’s a working example that’s entirely contained to routing inside the cluster:
The scheduler puts an API pod (A, 10.99.19.19) on Host 1, and another API pod (B, 10.99.17.3) on Host 2. They’re both behind a service named ‘myapi’ (10.240.0.1). The addresses are arbitrary, and it doesn’t really matter what your API does.
You then run a pod (C, 10.11.3.6) that depends on the API pods. While the pods on A and B do have addresses, those addresses could change at any moment. Say you update the image tag, or change the configuration in a way that requires a restart. The API pods will be recreated with new addresses.
For that reason you instead direct your application in pod C to seek the API on the service address. Recall we put A and B behind a service named ‘myapi’. Doing this created an A record in coredns named “myapi.svc.mynamespace.cluster.myk8s.tld”. That A record resolves to 10.240.0.1.
On the backend, when kube-proxy saw that service get created and it enumerated the addresses of pods referenced by it to generate DNAT iptables rules. Thus creating a mapping for traffic sent to 10.240.0.1 to instead be sent to 10.99.19.19 or 10.99.17.3.
To keep this short - kube-proxy is responsible for managing IPtables to properly map services to their pods.
Kubelet sits on each host in the cluster and does the talking to the container runtime (docker, for example). There’s not a lot to say about kubelet. It’s actually the component that I’ve had the fewest issues with. Even during upgrades.
There’s a lot of it. A big argument for Nomad and K3s is that they are a single binary. Personally, there are certainly times I look at pods running in the kube-system namespace and think - “God, this list is huge”. In those instances, I yearn for the single-process-per-host model. Is it running and saying it’s fine? Great. Is it not running or saying something is wrong? Well, at least I know what I can expect to be broken.
In contrast, kubernetes can break in as many ways as their are combinations of components. There are more ways for Kubernetes to screw me over than there are stars in the sky. And yes, sure, splitting things out makes sense in a lot of scenarios. But for a small cluster they’re largely not applicable.
Setting up a new cluster is trivial with kubeadm. But it places very little requirements on the sysadmin, and leaves you with something almost black-box in nature. There’s no way to improve the situation on this front. Except maybe requiring kubeadm init ask you a series of exam questions. I’m being glib, I know.
The logs for all these components aren’t unified. Relating something in kubelet’s logs to something in the scheduler logs isn’t possible without effort. In the case of in-cluster components, it’s slightly better. But in this instance, they’re not even the same format. CNI (which I’ll cover in another post) compounds this issue.
I’m mostly done with this post, but here’s some rambling thoughts about how things could maybe be improvided. At least for small clusters, if you take a step back and look core concerns, there’s a few instances where you could see two of them merged. Here’s three examples I can think of.
You could possibly get rid of coredns and instead have kubelet (on each node) provide that function. It would be one less component to worry about, and failure of DNS would be curtailed to one node instead of the entire cluster. But there’s some concerns with this. First, we’d be increasing the communication between kube-apiserver and kubelet. While that may be a good trade for small clusters, a centralised DNS service potentially allows for consistent responses and distribution of traffic. Finally, perhaps more importantly, it would be a move from an off-the-shelf solution to something kubernetes-specific, which I think is generally undesirable.
You could maybe roll kube-proxy into kubelet. I’m not certain I want routing and communication with the container runtime handled in the same process. But it would be a more intuitive arrangement having the process running on every node (kubelet) handling the iptables rules for its local host. With that said, the major reason not to do this is that iptables isn’t the only backend for kube-proxy - it also support IPVS, making for a non-trivial surface area of code.
As a final example, since this isn’t an article on what components we could imagine squashed together - you could potentially have the scheduler rolled into the controller manager. I suppose it’s like this because the controller manager manages kubernetes objects, but fundamentally, the scheduler is managing host resources? Or maybe because you can go out and write your own scheduler. My personal take is that you should think thrice before doing so, but I’ve been in situations where it was a viable solution.
In no particular order: