There's been a few articles pop up lately that advocate Nomad. This isn't an argument against using Nomad over Kubernetes. Rather, this is the first of a series of posts I want to make about Kubernetes from the perspective of a system administrator. They're all going to be written from angle the reader has at least dabbled with kubernetes. I don't have it in me to go over what a ‘pod’ is.
This specific post is about the components that make up a kubernetes cluster, and some of the badness that comes out of its current architecture. There's plenty to complain about, so I wont be talking about CNI or more complex routing concepts (like propelry handling asymmetric routing) quite yet.
Kubernetes has a broad architecture. There's a lot of individual processes involved, and you need to care about the health of each and every one.
To be clear - I'm not advocating understanding how every single component works here. But I am advocating that if you want to use Kubernetes, you at least know what they do, because when they go tits-up, you want to know at least two things:
etcd is the datastore. This is where cluster information is kept. You may have seen it elsewhere. It's a well tested solid key:value storage solution built around RAFT concensus.
coredns is the DNS server. Something necessary if you want containers to be able to reference the address of other containers when those addresses are ephemeral. Although not as commonly seen outside of Kubernetes as etcd, it's also an off-the-shelf solution I'm happy to see used.
The API server is your interface to the cluster, and also the cluster's interface to the cluster. A nice way to consider it is as a REST API to etcd that also applies access control and object validation.
The controller-manager runs ‘controllers’. On a very, very high level, you can think of controllers as processes that watch etcd through the apiserver and do things when objects are added, removed, or changed.
Deployment object created? The deployment controller reads the object and creates some pod objects from the template it contains. Cronjob object created? The cronjob controller starts a pod defined in that object on the appropriate interval.
The scheduler handles figuring out where pods should run, given not all hosts are equal. Want your pods to run on a node with an SSD? What about requiring an SSD, and preferring to run on nodes in different datacentres. What about requiring nodes with an SSD, in different datacentres, preferring nodes that don't already run a pod in that deployment?
How that preference is specified can vary, as the distribution of pods to nodes is a topic that has changed a few times. Pod Topology Spread Constraints is a relatively new feature that handles distribution on the basis of a node label. It bypasses a lot of the wrangling that previously would have been required through tolerations. I wont cover either here.
Kube-proxy takes more explaining. Here's a working example that's entirely contained to routing inside the cluster:
The scheduler puts an API pod (A, 10.99.19.19) on Host 1, and another API pod (B, 10.99.17.3) on Host 2. They're both behind a service named ‘myapi’ (10.240.0.1). The addresses are arbitrary, and it doesn't really matter what your API does.
You then run a pod (C, 10.11.3.6) that depends on the API pods. While the pods on A and B do have addresses, those addresses could change at any moment. Say you update the image tag, or change the configuration in a way that requires a restart. The API pods will be recreated with new addresses.
For that reason you instead direct your application in pod C to seek the API on the service address. Recall we put A and B behind a service named ‘myapi’. Doing this created an A record in coredns named “myapi.svc.mynamespace.cluster.myk8s.tld”. That A record resolves to 10.240.0.1.
On the backend, when kube-proxy saw that service get created and it enumerated the addresses of pods referenced by it to generate DNAT iptables rules. Thus creating a mapping for traffic sent to 10.240.0.1 to instead be sent to 10.99.19.19 or 10.99.17.3.
To keep this short - kube-proxy is responsible for managing IPtables to properly map services to their pods.
Kubelet sits on each host in the cluster and does the talking to the container runtime (docker, for example). There's not a lot to say about kubelet. It's actually the component that I've had the fewest issues with. Even during upgrades.
There's a lot of it. A big argument for Nomad and K3s is that they are a single binary. Personally, there are certainly times I look at pods running in the kube-system namespace and think - “God, this list is huge”. In those instances, I yearn for the single-process-per-host model.
In contrast, kubernetes can break in as many ways as their are combinations of components. There are more ways for Kubernetes to screw me over than there are stars in the sky. And yes, sure, splitting things out makes sense in a lot of scenarios. But for a small cluster they're largely not applicable.
Setting up a new cluster is trivial with kubeadm. But it places very little requirements on the sysadmin, and leaves you with something almost black-box in nature. There's no way to improve the situation on this front. Except maybe requiring kubeadm init ask you a series of exam questions. I'm being glib, I know. But the complexity of kubernetes can't be side-stepped, and tools like kubeadm that provide a quick-start solution migrate that complexity onto your internal documentation.
The logs for all these components aren't unified. Relating something in kubelet's logs to something in the scheduler logs isn't possible without effort. In the case of in-cluster components, it's slightly better. But in this instance, they're not even the same format. CNI (which I'll cover in another post) compounds this issue.
In no particular order: