Analyzing SimKube 1.0: How well does it work?
Ok, this post is the last part of my 3-part series on SimKube 1.0. Over the last couple of weeks, we walked through some background about SimKube, the Kubernetes Cluster Autoscaler (KCA) and Karpenter, and then performed an in-depth analysis of the two autoscalers based on simulation data. Today, we’re going to take a look at SimKube itself: its performance, some lessons that I learned from this set of experiments, and next steps in the Kubernetes simulation world. If you want to go back and read the previous posts from this series first (recommended), you can find them here:
- Part 1 - Background and Motivation
- Part 2 - Running the Simulations
- Part 3 (this post) - Analysis and Following up
As before, I’ll note that all of the raw data for these experiments is publicly available for download here if you’re interested in trying it out on your own!
Who watches the Simulator?
One of the selling points of SimKube is that you can (supposedly) run simulations of multi-thousand-node Kubernetes clusters on your laptop. But, all the experiments I ran last week were run on an AWS c6i.8xlarge instance with 32 vCPUs and 64GB of RAM! I dunno about you, but my laptop doesn’t have 32 vCPUs and 64GB of RAM, so what gives?
As I hinted at briefly last week, it all comes down to metrics. SimKube itself actually ran fine, even with my largest simulations. I’m not going to include the graphs here because they’re not that interesting, but both the simulation controller and the simulation driver pods used a tiny fraction of the host’s available CPU and about 20MB of memory. The KWOK controller1 similarly uses around half a core and 60-100MB of RAM; we only had about a maximum of 100 nodes in the Karpenter experiment, and more like 1000 nodes in the CA experiment2. If we assume that this resource consumption scales linearly3 then you’re looking at maybe half a gig of memory for a 5000-node cluster, which is the largest size supported by “official” Kubernetes (again I don’t think these graphs are particularly interesting, you can see them in the Jupyter notebooks for the experiments)
Just for funsies, we can also take a look at the resource utilization for the Kubernetes control plane (apiserver, controller-manager, and scheduler). We’re not taxing Kubernetes itself particularly hard with this experiment, but this can give you an idea of the resource utilization by the k8s control plane for a middling-size cluster: