[{"content":" 🔗 This post is part of the \u0026ldquo;Kubernetes on Raspberry\u0026rdquo; series. To get the full context, it\u0026rsquo;s recommended to start from the beginning.\nHere is an overview of the entire series.\n Building a Home Lab with Raspberry Pis and k8s Released: June 2024\n   ️Hardware and Cluster Node configuration Released: August 2024\n    Bootstrapping the nodes      Installing k8s  TBA: October 2024\n    Monitoring  TBA: November 2024\n    Storage  TBA: EOY 2024\n    Ingress \u0026amp; exposing things to the world  TBA: 2025\n    Security  TBA: 2025\n    Running Databases  TBA: 2025\n    Running complex applications  TBA: 2025\n    BONUS: Enclosure \u0026amp; 3D Printing  TBA: Unknown\n        document.addEventListener(\"DOMContentLoaded\", function() { const iconMap = { \"past\": \"fa-solid fa-circle-check\", \"current\": \"fa-solid fa-circle-play\", \"upcoming\": \"fa-solid fa-spinner\", }; const defaultIcon = \"fa-solid fa-question-circle\"; const icons = document.querySelectorAll('.timeline-icon'); icons.forEach(function(iconDiv) { iconDiv.innerHTML = ''; let iconAdded = false; const classes = iconDiv.classList; classes.forEach(function(cls) { if (iconMap[cls]) { const iconElement = document.createElement('i'); iconElement.className = iconMap[cls]; iconDiv.appendChild(iconElement); iconAdded = true; } }); if (!iconAdded) { const defaultIconElement = document.createElement('i'); defaultIconElement.className = defaultIcon; iconDiv.appendChild(defaultIconElement); } }); });  Intro Hi there, nice to see you 👋!\nLast time we discussed about the hardware needed for building the cluster, its architecture and performed a rough cost estimation.\nI assume that since you\u0026rsquo;re still reading, you\u0026rsquo;re ready to start building the cluster. So let\u0026rsquo;s actually start doing that!\nThis post is going to be more practical. I\u0026rsquo;ll explain how to bootstrap the Raspberry Pis as Kubernetes nodes and some utilities to help you with that.\nLet\u0026rsquo;s dive in 🤿!\nKubernetes Specific Node Requirements There are quite a few requirements to make a Linux machine into a Kubernetes node. The relevant documentation is pretty nice (see here).\nA lot of these are usually available on a Linux system by default, so we\u0026rsquo;ll focus on the ones that need to be explicitly taken care of by us.\nOS choice We have a ton of options for operating systems. However, to avoid complicating things, I tried to find Linux distributions that already provide support for the Raspberry Pi 5.\nThe most obvious option would be Raspberry Pi OS. It\u0026rsquo;s the official Raspberry Pi Operating System, which means it supports the Pi\u0026rsquo;s hardware by default.\nHowever, I\u0026rsquo;m already quite familiar with Ubuntu since I\u0026rsquo;ve been using it for quite a few years and since there\u0026rsquo;s a recent LTS version officially certified for Raspberry Pi 5, I leaned more towards that.\n Ubuntu 24.04 LTS is certified for Raspberry Pi 5\n  The only issue I encountered while evaluating it was the absence of the iscsi_tcp module. By default it\u0026rsquo;s not installed, but that\u0026rsquo;s not hard to address, as seen below.\n .quoted { background: var(--entry); padding: 10px 20px; margin-bottom: 40px; border-radius: var(--radius); display: flex; flex-direction: column; flex-grow: 1; width: 100%; } .link-container { display: flex; flex-direction: row; align-items: center; } .image-div { margin-right: 5px; max-width: 10%; } .ask-ubuntu-logo { background-color: #DC461D; padding: 5px !important; }            Missing iscsi_tcp kernel module in Ubuntu 21.10 for Raspberry Pi ARM64     So finally, I opted in for Ubuntu Server LTS 24.04 for Raspberry Pi (download link, preinstalled image).\nPackages / configuration We\u0026rsquo;ll disable swap since Kubernetes requires it. There\u0026rsquo;s also a few required packages that\u0026rsquo;ll have to be installed, mainly Longhorn dependencies. Mainly these are:\n python3-pip git apt-transport-https curl avahi-daemon nfs-common linux-modules-extra-raspi  Now, while some of these aren\u0026rsquo;t expicitly required for Kubernetes (like git), they make node management easier.\nNode Access \u0026amp; Networking To safeguard the nodes, I decided to create 2 different accounts:\n one will allow password login, since I might lose SSH access for some reason and I\u0026rsquo;d like to plug a screen, mouse and keyboard on each node and have access. Users won\u0026rsquo;t be allowed to SSH as this account, to prevent bruteforce attacks. another account with SSH enabled, which will authenticate using a public key. This one won\u0026rsquo;t have a password set.  I also ensured the nodes have consecutive static IP addresses (e.g. 192.168.2.1, 192.168.2.2 etc), predictable hostnames (e.g. rpi01-ubuntupi) and installed Avahi mDNS to access them via their .local DNS alias. While these aren\u0026rsquo;t Kubernetes requirements (except for the static IPs) they simplify administration by a lot.\nPartitions Having separate partitions isn\u0026rsquo;t strictly required. However, it\u0026rsquo;s a good practice, especially since these nodes will provide persistent storage. If persistent storage grows uncontrollably, it might cause disk pressure on the node and this might cause it to misbehave (e.g. Kubernetes could start killing pods because of it).\nBy creating an additional partition and using only that for persistent storage we can avoid this scenario.\nThere are a few other cases where things like that could happen, e.g. from log files growing uncontrollably, but these are more rare and I didn\u0026rsquo;t want to create a partition for that as well.\nThe preinstalled Ubuntu Server image we\u0026rsquo;ll be using already creates two partitions (one mounted at /boot/firmware and another one at /), so we\u0026rsquo;ll actually create a third one. Our disk will look something like this:\n$ lsblk /dev/nvme0n1 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS nvme0n1 259:0 0 465.8G 0 disk ├─nvme0n1p1 259:1 0 512M 0 part /boot/firmware ├─nvme0n1p2 259:2 0 125G 0 part / └─nvme0n1p3 259:3 0 340.3G 0 part /mnt/data Booting from NVMe disks If you Google that, you\u0026rsquo;ll see a ton of guides explaining how to make the Pi 5 boot from NVMe disks. However, in my case, it booted by default. Looks like my Raspberry Pis had this option enabled from the start, most likely they\u0026rsquo;re being shipped with a newer firmware. If, however, yours don\u0026rsquo;t boot from NVMe by default, here\u0026rsquo;s the official guide on configuring NVMe boot from the Raspberry Pi Documentation.\n .quoted { background: var(--entry); padding: 10px 20px; margin-bottom: 40px; border-radius: var(--radius); display: flex; flex-direction: column; flex-grow: 1; width: 100%; } .quoted .quote blockquote { border-inline-start: none; } .link-container { display: flex; flex-direction: row; align-items: center; justify-items: center; } .title { flex: 1; font-weight: bold; } .image-div { margin-right: 10px; max-width: 10%; display: flex; } .image-div svg { width: 100%; height: 100%; }  Raspberry Pi                 NVMe SSD Boot     NVMe (Non-Volatile Memory express) is a standard for external storage access over a PCIe bus. You can connect NVMe drives via the PCIe slot on a Compute Module 4 (CM4) IO board or Raspberry Pi 5. With some additional configuration, you can boot from an NVMe drive.    cmdline.txt and config.txt These two are configuration files we need to modify, each one for different reasons.\nAbout cmdline.txt  .quoted { background: var(--entry); padding: 10px 20px; margin-bottom: 40px; border-radius: var(--radius); display: flex; flex-direction: column; flex-grow: 1; width: 100%; } .quoted .quote blockquote { border-inline-start: none; } .link-container { display: flex; flex-direction: row; align-items: center; justify-items: center; } .title { flex: 1; font-weight: bold; } .image-div { margin-right: 10px; max-width: 10%; display: flex; } .image-div svg { width: 100%; height: 100%; }   Raspberry Pi: What is cmdline.txt and how to use it?     The cmdline.txt file is a configuration file, located in the boot partition of the SD card on Raspberry Pi, and used to pass additional parameters to the Linux Kernel for the system boot.    Here, what\u0026rsquo;s important is:\n net.ifnames=0: Disables the predictable network interface names feature, which assigns names like enp0s3 instead of the traditional eth0. By setting this option to 0, it ensures that network interfaces are named using the old style (eth0, wlan0, etc.). While technically this is not required, I opted for it because I\u0026rsquo;m used to such naming conventions. root=LABEL=writable: This specifies the root filesystem that the kernel should mount during boot. In this case, it\u0026rsquo;s referring to a partition labeled \u0026ldquo;writable.\u0026rdquo; This label exists by default on the 2nd partition that exists in the preinstalled Ubuntu image we\u0026rsquo;re using, and it\u0026rsquo;s easier than setting partition UUIDs here explicitly.  Additionally, quoting from Kubernetes Documentation\n .quoted { background: var(--entry); padding: 10px 20px; margin-bottom: 40px; border-radius: var(--radius); display: flex; flex-direction: column; flex-grow: 1; width: 100%; } .quoted .quote blockquote { border-inline-start: none; } .link-container { display: flex; flex-direction: row; align-items: center; justify-items: center; } .title { flex: 1; font-weight: bold; } .image-div { margin-right: 10px; max-width: 10%; display: flex; } .image-div svg { width: 100%; height: 100%; }  .st0{fill:#326DE6;} .st1{fill:#FFFFFF;} .st2{fill:#326DE5;}  Kubernetes_Logo_Hrz_lockup_POS       Kubernetes \u0026amp; cgroups     On Linux, control groups constrain resources that are allocated to processes. The kubelet and the underlying container runtime need to interface with cgroups to enforce resource management for pods and containers which includes cpu/memory requests and limits for containerized workloads.    So we\u0026rsquo;ll also need to enable:\n cgroup_enable=cpuset: Enables the cpuset controller within cgroups. The cpuset controller allows assigning specific CPUs to specific containers. Kubernetes uses this feature to ensure that containers can be allocated to specific CPUs, allowing for better control over CPU resource allocation and scheduling. cgroup_enable=memory: Enables the memory controller within cgroups. The memory controller is responsible for tracking and limiting memory usage of processes. Kubernetes uses this to ensure that each container stays within its defined memory limits and doesn\u0026rsquo;t consume more memory than allocated. cgroup_memory=1: This explicitly enables memory accounting in the cgroup memory controller. This ensures that the memory usage of all processes and containers is accurately tracked and managed. Memory accounting is crucial for Kubernetes to enforce memory limits and handle memory-related events (like OOM).  About config.txt  .quoted { background: var(--entry); padding: 10px 20px; margin-bottom: 40px; border-radius: var(--radius); display: flex; flex-direction: column; flex-grow: 1; width: 100%; } .quoted .quote blockquote { border-inline-start: none; } .link-container { display: flex; flex-direction: row; align-items: center; justify-items: center; } .title { flex: 1; font-weight: bold; } .image-div { margin-right: 10px; max-width: 10%; display: flex; } .image-div svg { width: 100%; height: 100%; }  Raspberry Pi                 What is config.txt     Instead of the BIOS found on a conventional PC, Raspberry Pi devices use a configuration file called config.txt. The GPU reads config.txt before the Arm CPU and Linux initialise. Raspberry Pi OS looks for this file in the boot partition, located at /boot/firmware/.    Here, we only need to focus on the PCI Express bus related settings. We want to enable the bus and also set it to Gen 3.0 speeds.\n .quoted { background: var(--entry); padding: 10px 20px; margin-bottom: 40px; border-radius: var(--radius); display: flex; flex-direction: column; flex-grow: 1; width: 100%; } .quoted .quote blockquote { border-inline-start: none; } .link-container { display: flex; flex-direction: row; align-items: center; justify-items: center; } .title { flex: 1; font-weight: bold; } .image-div { margin-right: 10px; max-width: 10%; display: flex; } .image-div svg { width: 100%; height: 100%; }  Raspberry Pi                 Raspberry Pi Connector for PCIe     By default, the PCIe connector is not enabled unless connected to a HAT+ device. To enable the connector, add the following line to /boot/firmware/config.txt:\ndtparam=pciex1\nThe connection is certified for Gen 2.0 speeds (5 GT/sec), but you can force Gen 3.0 (10 GT/sec) speeds. To enable PCIe Gen 3.0 speeds, add the following line to /boot/firmware/config.txt:\ndtparam=pciex1_gen=3\n   So, we\u0026rsquo;ll do just that.\nBootstrapping Process We have various ways of doing all the above. The most straightforward one would be:\n using a USB NVMe enclosure to plug the NVMe drive formatting it / copying the image with the Raspberry Pi Imager readjusting partitions (shrinking the 2nd one, creating the 3rd) booting and configuring the node with all the above  However, that\u0026rsquo;s quite cumbersome. Things can slip and if you\u0026rsquo;re configuring multiple drives it\u0026rsquo;s a bit of a mess.\nTo streamline this, I\u0026rsquo;m happy to introduce you to my bootstrapping utility!\n A set of scripts that help you bootstrap a Raspberry Pi 5 as a k8s node\n   .repo-preview { border: 1px solid var(--secondary); padding: 10px; border-radius: var(--radius); background-color: #2e2e33; box-shadow: 5px 6px 6px 4px #00000030; max-width: 70%; align-self: center; margin-bottom: 40px; } .repo-preview h2 { display: flex; flex: 1; color: var(--primary); margin: 10px; align-items: flex-end; flex-direction: column; } .header-container { display: flex; flex-direction: column; } .repo-header { display: flex; align-items: center; } .repo-image { margin-right: 15px; border-radius: 5px; } .repo-description { margin: 0; margin-top: 10px; margin-bottom: 0px !important; font-size: 1rem; align-self: flex-end; text-align: center; } .repo-stats { align-self: flex-end; font-size: 0.9rem; color: #555; display: flex; flex-direction: row; } .repo-stats span { margin-right: 15px; }  This is a set of interactive scripts (two in particular) that handle:\n copying the image to the NVMe drive creating the partition configuration we need copying cloud-init configuration to configure access, networking \u0026amp; install required packages etc modifying config.txt / cmdline.txt as specified above.  With these scripts the process is quite simple:\n plug the NVMe drive run each script with the required parameters the disk is ready, eject and boot the Pi!  Let\u0026rsquo;s look into the scripts themselves.\n .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   Wait, what\u0026rsquo;s cloud-init?\nFrom its docs\n Cloud-init is the industry standard multi-distribution method for cross-platform cloud instance initialisation. It is supported across all major public cloud providers, provisioning systems for private cloud infrastructure, and bare-metal installations.\n  During boot, cloud-init identifies the cloud it is running on and initialises the system accordingly. Cloud instances will automatically be provisioned during first boot with networking, storage, SSH keys, packages and various other system aspects already configured.\n  Cloud-init provides the necessary glue between launching a cloud instance and connecting to it so that it works as expected.\n In short, with cloud-init we can just add a script that runs on first boot and configures the node, leaving in a state where we can just SSH to it. So we\u0026rsquo;ll use cloud-init to configure:\n networking (IPs, gateways, hostnames etc). users / SSH access like we described above (passwords, insert SSH public keys etc). install required packages to get to a state where the node is ready to install Kubernetes.   Preparation Before running the script, we\u0026rsquo;ll need to set our desired configuration. Copy config.ini.sample into config.ini and set any of the options below under the [config_generator] section:\n hostname_string=\u0026quot;rpi{num}-ubuntupi\u0026quot;: Change this to whatever you like. You need to include the {num} segment, as this will be used to interpolate the node\u0026rsquo;s number remote_admin_acc_ssh_key: This is the public key of the remote user, the one we\u0026rsquo;ll use to SSH to the nodes remote_admin_acc_username: The username of the remote user local_admin_acc_username, local_admin_acc_password: The username / password of the local user gateway: Your gateway\u0026rsquo;s address eth_network: The ethernet network (if you\u0026rsquo;re using ethernet) wifi_ssid, wifi_password, wifi_network: The WiFi network (If you\u0026rsquo;re using WiFi)   Example script execution.\n  Partition manager script From the repo\u0026rsquo;s documentation:\n partition_manager.py: Copies an image to an SD card or NVMe disk. Also, creates an additional partition to use as storage on the node, separate from the system partition\n So, this is step 1. We plug the disk, identify it (e.g. /dev/sda) and run the script to copy the image / create partitions.\nA few interesting options are:\n --force: Without it, you won\u0026rsquo;t be able to overwrite a disk with an existing partition table. With it, you\u0026rsquo;ll still get a prompt to confirm if a partition table exists. --debug: Prints each command\u0026rsquo;s output in nicely formatted text. Useful to understand exactly what\u0026rsquo;s going on --help: Shows a prompt explaining all arguments  The command will also prompt for things like partition size etc.\nAn example execution would be:\n# This will format /dev/sdX, flash the image \u0026amp; create an additional partition ./run.sh partition_manager /dev/sdX --image-path ./ubuntu-24.04-preinstalled-server-arm64+raspi.img --force --debug   .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   Pay extra attention to the disk you use with the script!\nIf you choose the wrong disk you can mess your OS pretty badly! Make sure you\u0026rsquo;ve passed the correct disk, the one you want to use with your Pi. I like to double / triple check things before running, that\u0026rsquo;s why the script will also issue a warning if a non-empty disk is passed.\nBut in any case, be very cautious!\n Config generator script Again, from the repo\u0026rsquo;s docs:\n config_generator.py: Generates cloud-init configuration, to help bootstrap the node on first execution. Also optionally copies this config in the boot partition.\n Without removing the disk, we\u0026rsquo;ll run this script as well to generate cloud-init configuration and optionally copy it over to the disk as well (recommended).\nApart from the options above, you can also use:\n --hosts-number: If you want to generate configuration for multiple hosts at once, set this to something greater than 1. If you do that, the script will generate directories for each node and will skip copying the files over to the NVMe disk. --offset 5: If you\u0026rsquo;re formatting each disk and generating config for each node together, this is useful. After the 1st node, set this to 1, 2, 3 etc to generate config for the 2nd, 3rd, 4rth node respectively. This mode will copy the config over as well.  Example executions:\n# This will generate \u0026amp; copy cloud-init configuration. It'll generate static IP configuration for WiFi, but not for ethernet. ./run.sh config-generator /dev/sdX --no-setup-eth --setup-wifi --debug --hosts-number 1 --force # This will do the same, but will generate ethernet configuration and think this is the 6th host, so it'll name it as rpi06-ubuntupi.local. You can change the name pattern in the config file. ./run.sh config_generator /dev/sdX --no-setup-wifi --setup-eth --debug --hosts-number 1 --force --offset 5 Result Once you\u0026rsquo;ve done this, plug the NVMe disk into the Pi and let it run. It could take a few minutes, mine took around 5. Hopefully, if all goes well, you should be able to ping each node at {hostname}.local (or using its IP ofc) and SSH to it once cloud-init has finished.\n .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   If something goes wrong, you should be able to use the local account username / password to login to the Pi with a mouse / keyboard / screen. If that\u0026rsquo;s not working either, I\u0026rsquo;d make sure the Pi is booting from NVMe by flashing an SD card with and trying to boot from it.\n Outro Hopefully by now you\u0026rsquo;ll have your nodes ready to run Kubernetes!\nIn the next post we\u0026rsquo;ll use Ansible (kubespray) to install it and get our cluster going so feel free to subscribe below to stay in touch!\n📣 Don\u0026rsquo;t be a stranger! If you used my scripts above to setup your cluster, I\u0026rsquo;d like to hear from you!\n❓Was it useful or not?\n❓Did you face any issues?\nHappy to hear your thoughts / feedback either in the comments or in any of my social channels.\n","permalink":"https://iamsafts.com/posts/rpi_k8s/part2_bootstrap/","summary":"Preparing (bootstrapping) the Raspberry Pis to become Kubernetes nodes","title":"Kubernetes on Raspberry: Bootstrapping the Raspberry Pis"},{"content":" 🔗 This post is part of the \u0026ldquo;Kubernetes on Raspberry\u0026rdquo; series. To get the full context, it\u0026rsquo;s recommended to start from the beginning.\nHere is an overview of the entire series.\n Building a Home Lab with Raspberry Pis and k8s Released: June 2024\n    ️Hardware and Cluster Node configuration     Bootstrapping the nodes Released: September 2024\n    Installing k8s  TBA: October 2024\n    Monitoring  TBA: November 2024\n    Storage  TBA: EOY 2024\n    Ingress \u0026amp; exposing things to the world  TBA: 2025\n    Security  TBA: 2025\n    Running Databases  TBA: 2025\n    Running complex applications  TBA: 2025\n    BONUS: Enclosure \u0026amp; 3D Printing  TBA: Unknown\n        document.addEventListener(\"DOMContentLoaded\", function() { const iconMap = { \"past\": \"fa-solid fa-circle-check\", \"current\": \"fa-solid fa-circle-play\", \"upcoming\": \"fa-solid fa-spinner\", }; const defaultIcon = \"fa-solid fa-question-circle\"; const icons = document.querySelectorAll('.timeline-icon'); icons.forEach(function(iconDiv) { iconDiv.innerHTML = ''; let iconAdded = false; const classes = iconDiv.classList; classes.forEach(function(cls) { if (iconMap[cls]) { const iconElement = document.createElement('i'); iconElement.className = iconMap[cls]; iconDiv.appendChild(iconElement); iconAdded = true; } }); if (!iconAdded) { const defaultIconElement = document.createElement('i'); defaultIconElement.className = defaultIcon; iconDiv.appendChild(defaultIconElement); } }); });  Intro Hi there, welcome 👋!\nAfter the introductory post in the [rpi-k8s] series, it\u0026rsquo;s now time we start exploring how to actually build the cluster.\nIn this post, we\u0026rsquo;ll talk about our cluster\u0026rsquo;s hardware:\n each node\u0026rsquo;s hardware configuration which / how many Raspberry Pis are needed additional components required (for storage, cooling etc)  I\u0026rsquo;ll also provide an estimated cost \u0026amp; links (wherever possible).\n .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   Initially I wanted to explain how I 3d printed the rack-mount for the cluster and some additional components in this post as well. However, even without that, this post turned out quite lengthy, so I decided to split it into a future post.\n Let\u0026rsquo;s get started!\nHardware and node configuration Node hardware Let\u0026rsquo;s see what kind of hardware is required for this and why.\nRaspberry Pi configuration To make the most out of our cluster and support meaningful workflows, we\u0026rsquo;ll need to use Raspberry Pi 5s with the maximum amount of RAM. The 5\u0026rsquo;s CPU offers a significant performance boost over the Pi 4. Its 8GB RAM version also gives us plenty of memory to work with which is crucial. Given that Kubernetes core components are pretty memory hungry, anything less would be insufficient, especially for control plane nodes.\n apiserver pod memory usage is close to 2GB on the leader, ~700MBs on every other control plane node.\n   .dual-figure { display: flex; flex-direction: row; justify-content: space-around; margin: 0; align-items: center; } .dual-figure figure { flex: 1; margin: 0; margin-right: 20px; } .dual-figure figure:last-child { margin-right: 0px; } .dual-figure figure img { flex: 1; } @media screen and (max-width: 768px) { .dual-figure { flex-direction: column; } .dual-figure figure { margin-right: 0px; } }   Average CPU utilization per node\n   Average memory utilization per node\n   Storage Traditionally the Pis used SD cards for storage. While this works, there are significant drawbacks like the reliability of SD cards (they tend to fail after some power cycles) and their very poor performance. Since the Pi 4 we\u0026rsquo;ve been able to boot from USB SSD disks, which is a significant improvement and a lot of Raspberry Pi Kubernetes clusters have been built using that as storage.\nHowever, the Pi 5\u0026rsquo;s PCI express Gen 2.0 lane can unofficially achieve Gen 3.0 speeds. This gives us the opportunity to use NVMe storage, gaining a significant performance boost. Here\u0026rsquo;s the theoretical performance of the 3 storage methods compared:\n   Metric U3 V30 SD Card NVMe SSD (USB 3.0) NVMe SSD (PCIe 2.0 x1) NVMe SSD (PCIe 3.0 x1)     Sequential Read 90-100 MB/s 400-500 MB/s 400-500 MB/s 900-1000 MB/s   Sequential Write 60-90 MB/s 400-500 MB/s 400-500 MB/s 900-1000 MB/s   Random Read IOPS 1,500 - 3,000 IOPS 20,000 - 50,000 IOPS 50,000 - 100,000 IOPS 400,000 - 600,000 IOPS   Random Write IOPS 1,000 - 2,500 IOPS 20,000 - 50,000 IOPS 50,000 - 100,000 IOPS 300,000 - 500,000 IOPS    So using an NVMe drive especially in speeds that approach the PCIe Gen 3.0 bus speeds should offer a significant benefit, both for the OS and the node\u0026rsquo;s performance in general, but also for the performance of our storage intensive workflows.\n .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   About storage performance\nSomething that will become apparent below is that we\u0026rsquo;ll use two options for storage:\n a high performance one, which does not offer redundancy and doesn\u0026rsquo;t allow exchanging PVs between different nodes a lower performance, cloud-native \u0026amp; more reliable version which allows flexible cloud workflows but is significantly slower (at least given our hardware limitations).  Now, while we won\u0026rsquo;t be able to achieve the speeds above, our high-performance storage is able to get close enough and should be more than suitable for running our database workloads (which, for now, is the most storage demanding workflows we\u0026rsquo;ll need to run).\nWe\u0026rsquo;ll talk more about this in the storage related post, but here\u0026rsquo;s a benchmark sneak peek:\nIOPS (Read/Write) Random: 147,728 / 101,610 Sequential: 34,090 / 109,379 Bandwidth in KiB/sec (Read/Write) Random: 838,629 / 623,584 Sequential: 845,459 / 700,666 Latency in ns (Read/Write) Random: 100,895 / 35,608 Sequential: 37,713 / 32,655   NVMe HATs To use PCIe NVMe storage for the Pis, we will rely on an NVMe HAT. HATs (Hardware Attached on Top) are a standard way of enhancing a Pi\u0026rsquo;s hardware and are often used for network expasion, providing PoE to the Pis and a bunch of other things.\n .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   To get a glimpse into what HATs can do, Jeff Geerling offers a very comprehensive list on his blog with quite a few examples.\n There are quite a few NVMe HATs for storage. Each one offers different capabilities, from mounting position (top / bottom) to the possibility of mounting different format NVMe disks (2230, 2242, 2260, or 2280) or even two of them in some cases. Some even combine NVMe expansion with other expansion options, like AI chips or 2.5G network. While we could benefit from 2.5G network, most of the combined options (the ones that provide more than a single NVMe disk) limit the bus speed to Gen 2.0 speeds, which is a trade-off I wasn\u0026rsquo;t willing to make, especially for high-performance storage nodes.\nIn my case, I chose the 52Pi N04 (Amazon.de, non-affiliate), for several reasons:\n it offers Gen 3.0 speeds. it supports the 2280 SSD format, which is widely used and makes it easier to find affordable NVMe drives. it is mounted on top, which makes it more compatible with my vertical rack holders.  The Gen 3.0 speed requirement is only relevant for the highly-performance storage nodes. The rest (running Longhorn) are bottlenecked by various other factors. For these, I\u0026rsquo;m considering switching to something like the Geekworm X1004 since:\n having an extra disk might improve Longhorn performance. I\u0026rsquo;ll be able to use a dedicated disk for the OS and another, dedicated disk for Longhorn which supposedly makes it faster (not validated yet) having a separate disk for the OS makes things easier to manage the X1004 replicates the GPIO, which the 52Pi N04 doesn\u0026rsquo;t do and might be limiting in the future  Keep in mind that these HAT options made sense for me but might not make sense for your build. In that case, the list above should be a fairly good starting point to discover what suits your individual needs best.\n .dual-figure { display: flex; flex-direction: row; justify-content: space-around; margin: 0; align-items: center; } .dual-figure figure { flex: 1; margin: 0; margin-right: 20px; } .dual-figure figure:last-child { margin-right: 0px; } .dual-figure figure img { flex: 1; } @media screen and (max-width: 768px) { .dual-figure { flex-direction: column; } .dual-figure figure { margin-right: 0px; } }   N04 node\n   X1004 node\n   NVMe drives Some NVMe HATs (like the Pimoroni HAT) offer a list of pre-tested \u0026amp; approved models for NVMe drives. Neither of the ones I used above do, but I\u0026rsquo;ve been using this KIOXIA Exceria 500GB drive without any issues. Performance is fine, and it\u0026rsquo;s quite budget friendly too.\nThermals To begin with, we\u0026rsquo;ll use the Pis official cooler to cool our nodes. While this is not enough if we have a lot of nodes stacked together, it lets us operate within acceptable temperatures. Using just that, in a vertically mounted rack configuration and a pretty low-usage scenario, my Pis average on 50°C.\nWhile not bad, at some point I decided to experiment with some case fans and saw a significant temperature drop around 15 degrees. So this is something worth considering, especially if you want to keep your cluster utilization high.\n .dual-figure { display: flex; flex-direction: row; justify-content: space-around; margin: 0; align-items: center; } .dual-figure figure { flex: 1; margin: 0; margin-right: 20px; } .dual-figure figure:last-child { margin-right: 0px; } .dual-figure figure img { flex: 1; } @media screen and (max-width: 768px) { .dual-figure { flex-direction: column; } .dual-figure figure { margin-right: 0px; } }   Temperature comparison without / with case fans\n   Case fan mounting option\n   Networking I\u0026rsquo;m assuming that most people playing around with this will not have a very advanced home network setup. I most certainly didn\u0026rsquo;t, and faced a number of issues along the way because of it so we\u0026rsquo;ll explore them here as we go along, making little to no assumptions about your network setup and its capabilities.\nTo avoid discouraging anyone, I\u0026rsquo;ll share that when I first started building the cluster, I was only using my ISP\u0026rsquo;s modem / router for networking. I was using WiFi (the Pis have onboard WiFi controllers) and everything pretty much worked. I did transition to a Mikrotik CCR-2004-16G-2S+PC somewhere along the line and it did make a difference in management, but it was in no way required to do so.\nHere\u0026rsquo;s roughly what you need to keep in mind:\n it\u0026rsquo;s really important for Kubernetes to provide static IPs to the nodes. Dynamic IPs can cause issues, so we need to ensure each node is configured with a static IP that\u0026rsquo;s outside your DHCP server\u0026rsquo;s range. it\u0026rsquo;s best to avoid using WiFi. Ethernet is much more stable and definitely more performant. if you opt for ethernet, you\u0026rsquo;ll also need at least an 8-port (or smaller even) switch. You can get a pretty cheap one, we don\u0026rsquo;t need any special features, any unmanaged switch will do.  Cluster node configuration / constraints When building a Kubernetes cluster there are not really any very specific hardware restrictions. Apart from some vary basic CPU / memory requirements, one can build a cluster on top of beefy servers, in a single machine (by creating VMs with an OS like ProxMox) or even inside a docker container.\nHowever, to make the most out of this experiment, I decided to add some limitations.\n I want to have more than one nodes  this gives me the opportunity to experiment with high availability and see how the apps I deploy function when e.g. a node suddenly becomes unavailable.   I would like my cluster to have persistent storage  most applications require some sort of persistence. I wouldn\u0026rsquo;t like to build a cluster that can\u0026rsquo;t support a single database workflow. my storage needs to be as reliable and as performant as possible (I\u0026rsquo;ll explain more on this in a later post).   Ideally, I\u0026rsquo;d like to be able to remove some nodes without the cluster losing basic functionality  e.g. given that I don\u0026rsquo;t have another Linux machine, being able to remove one node, swap the disk (or add an SD card) and have a Linux machine available for experimentation is pretty useful. if a Pi (or a disk) fails, I shouldn\u0026rsquo;t lose any functionality in the cluster (or data).   I want to run meaninful workflows on the cluster  this means we\u0026rsquo;ll need as high-powered CPUs as we can find on the Pis, but also we\u0026rsquo;ll need the biggest available memory configuration (meaning, we\u0026rsquo;ll need RPi 5s, the 8GB versions).    Now, a big factor that shapes the node configuration are the constraints imposed by storage. After looking into my options for storage, the most prominent choices seemed to be two (both provided by Rancher):\n Longhorn  Highly available, cloud native storage. Pods can mount PVs that are not necessarily on the same node. Optimal for workloads that can move around nodes or workloads with replicas on different nodes that need to mount the same PV. Sadly, in our setup it was pretty slow (we\u0026rsquo;ll talk more about why in the storage post). Longhorn needs at least 3 physical nodes to guarantee data availability.   Local-path provisioner  Local storage, meaning the PV needs to be on the same node with the pod(s) using it. Very fast (especially when using a fast disk).    Highly available, fast storage version (HA) The Kubernetes control plane \u0026amp; Longhorn both need at least 3 nodes to provide high availability.\nSo, to build a highly available (HA) cluster, we end up with:\n 3 control plane nodes (not necessarily dedicated) 4 Longhorn nodes (3 as a minimum + 1 for redundancy) 2 local-path nodes (1 as a minimum + 1 for redundancy)  If we don\u0026rsquo;t need the control plane nodes to be dedicated and can afford reusing them as storage nodes as well, (I don\u0026rsquo;t see a reason not to do that), then we can achieve all of the above with a 6 node cluster.\n Our HA cluster\n   .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   One design choice I made was separating Longhorn \u0026amp; local-path storage into separate nodes. There\u0026rsquo;s no strict requirement to do that, you can have both in the same node and use different paths / partitions. However, you\u0026rsquo;d need larger disks to do that and it might get more complex to manage.\n Other variations While this version offers high availability, it obviously comes with a larger hardware cost. Let\u0026rsquo;s also explore a few lighter variations:\n if we don\u0026rsquo;t need to casually remove nodes from the cluster, we can remove 2 nodes which leaves us with 4 nodes. if we also don\u0026rsquo;t need to have local-path storage, we can remove another node.  or you could even add local-path storage and Longhorn storage in the same node.   if we don\u0026rsquo;t care about having a highly available control plane, we can only run it in 1 node, although if we care about fault-tolerant Longhorn storage we need 3 nodes anyway. if, now, we don\u0026rsquo;t care about fault-tolerant Longhorn or control plane high availability, we can have just a 2 node cluster. But this is not really recommended, since a disk failure would most likely cause data loss.  Variation #2 offers high availability (and highly performant storage if you install both Longhorn and local-path) but limits the capacity of our cluster for meaningful workloads since all nodes run the control plane and worker pods as well.\nVariation #3 gives the cluster some breathing room, trading availability for worker capacity.\nVariation #4 is the most budget friendly option we can have while still being able to experiment with storage workflows.\n .dual-figure { display: flex; flex-direction: row; justify-content: space-around; margin: 0; align-items: center; } .dual-figure figure { flex: 1; margin: 0; margin-right: 20px; } .dual-figure figure:last-child { margin-right: 0px; } .dual-figure figure img { flex: 1; } @media screen and (max-width: 768px) { .dual-figure { flex-direction: column; } .dual-figure figure { margin-right: 0px; } }   Variation 1 doesn\u0026rsquo;t sacrifice anything apart from the ability to remove nodes.\n   Variation 2 sacrifices the capacity of the worker nodes, since there are no dedicated workers.\n    .dual-figure { display: flex; flex-direction: row; justify-content: space-around; margin: 0; align-items: center; } .dual-figure figure { flex: 1; margin: 0; margin-right: 20px; } .dual-figure figure:last-child { margin-right: 0px; } .dual-figure figure img { flex: 1; } @media screen and (max-width: 768px) { .dual-figure { flex-direction: column; } .dual-figure figure { margin-right: 0px; } }   Variation 3 trades high availability for increased worker capacity.\n   Variation 4 is pretty budget friendly, although quite limited.\n   Cost breakdown Keeping in mind all of the above, let\u0026rsquo;s do a summary of the costs we\u0026rsquo;ve identified so far and see how much each node will cost. I\u0026rsquo;ll use amazon.de links to have a common point of reference, but in my case local vendors were much cheaper for quite a few of these parts.\n   Component Cost     Raspberry Pi 5 8GB 95€   Official PSU 15€   Official Active Cooler 8€   N04 NVMe HAT 14€   KIOXIA Exceria NVMe 500 GB 40€   Total 172€    These prices include VAT and components purchased locally were at least 10% cheaper. But this is a high-level estimate on how much a single node should cost.\nThat puts our 6 node cluster close to 1.000€ total, or a more humble, 4-node cluster close to 700€. While that might seem high, keep in mind that a 2-node cluster, an extremely limited managed database and a load balancer on the cheapest provider out there (Digital Ocean) costs around 700€/year.\nSo if you\u0026rsquo;re in this for the long run, it\u0026rsquo;s definitely worth it.\nConcluding Hopefully by now you\u0026rsquo;ll know:\n what components to purchase how to architect your cluster (and why) an overview of the cost required for this effort  In the next post, I\u0026rsquo;ll explain how to bootstrap the nodes, install the required utilities \u0026amp; configure them to install Kubernetes. Don\u0026rsquo;t forget to subscribe below to get updates about next posts.\nAlso, if you\u0026rsquo;re building this, I\u0026rsquo;d love to hear your thoughts and any questions you might have. Feel free to drop a comment below or reach out to me on my socials.\nSee you soon 👋!\n","permalink":"https://iamsafts.com/posts/rpi_k8s/part1_hardware/","summary":"Explaining the hardware used for the nodes, the cluster\u0026rsquo;s node configuration and costs.","title":"Kubernetes on Raspberry: Hardware and Cluster Node configuration"},{"content":" 🔗 This post is part of the \u0026ldquo;Kubernetes on Raspberry\u0026rdquo; series. To get the full context, it\u0026rsquo;s recommended to start from the beginning.\nHere is an overview of the entire series.\n  Building a Home Lab with Raspberry Pis and k8s     ️Hardware and Cluster Node configuration Released: August 2024\n   Bootstrapping the nodes Released: September 2024\n    Installing k8s  TBA: October 2024\n    Monitoring  TBA: November 2024\n    Storage  TBA: EOY 2024\n    Ingress \u0026amp; exposing things to the world  TBA: 2025\n    Security  TBA: 2025\n    Running Databases  TBA: 2025\n    Running complex applications  TBA: 2025\n    BONUS: Enclosure \u0026amp; 3D Printing  TBA: Unknown\n        document.addEventListener(\"DOMContentLoaded\", function() { const iconMap = { \"past\": \"fa-solid fa-circle-check\", \"current\": \"fa-solid fa-circle-play\", \"upcoming\": \"fa-solid fa-spinner\", }; const defaultIcon = \"fa-solid fa-question-circle\"; const icons = document.querySelectorAll('.timeline-icon'); icons.forEach(function(iconDiv) { iconDiv.innerHTML = ''; let iconAdded = false; const classes = iconDiv.classList; classes.forEach(function(cls) { if (iconMap[cls]) { const iconElement = document.createElement('i'); iconElement.className = iconMap[cls]; iconDiv.appendChild(iconElement); iconAdded = true; } }); if (!iconAdded) { const defaultIconElement = document.createElement('i'); defaultIconElement.className = defaultIcon; iconDiv.appendChild(defaultIconElement); } }); });   .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   This is the first in a series of posts\nWhile building my home lab, I came across a lot of posts from people who had built similar things. However, I always felt that I was missing either the context (why do this thing in this particular way) or the details, implementation or otherwise. By splitting this into different posts, I hope to provide the full context of each decision I make, along with a comprehensive \u0026amp; complete guide so that you can do the same.\nThis first post covers the reasoning behind it all and why I\u0026rsquo;m doing it in that particular way, so it\u0026rsquo;s a bit high-level and introductory. The next ones will be more technical and closer to actual tutorials.\nI will be updating this post with links to other related posts as they go live.\n  .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   Illustrations were created by Copilot Designer (DALL·E 3). Expect AI randomness!\n Introduction Ever since I started working in tech, I\u0026rsquo;ve been fascinated by automation and standardization. I\u0026rsquo;ve always admired efficient CI / CD pipelines, scalable \u0026amp; configurable infrastructure that can be provisioned \u0026ldquo;at the touch of a button\u0026rdquo; and having repeatable setups to host your services on.\nWhen I started out, it wasn\u0026rsquo;t uncommon for people to deploy their services on bare metal servers that were provisioned once and then lived forever. These servers were difficult to maintain, monitor, upgrade etc and a number of solutions were developed by system administrators to work with them. It all though felt very cumbersome, at least from my (undoubtedly limited, I\u0026rsquo;ll give you that) developer\u0026rsquo;s point of view. I always felt that there would eventually be a better way.\nOver the years, many people would manage /deploy their apps with technologies like Ansible, or develop sets of scripts to provision devices \u0026amp; servers for specific use cases. Some would use (or build) images that came with the dependencies of their apps preinstalled and host their apps off of that. These solutions all worked well, but then came Docker and shortly after one of the best ways to run containers in production, Kubernetes.\nThis felt like a big change. Development was streamlined, you could spin up a bunch of (micro)services locally and not break your development machine. You could let them interact (almost) as they would in production with little to no overhead. Productivity skyrocketed and the days of \u0026ldquo;breaking your local env\u0026rdquo; because you upgraded a package or missed a line of config were pretty much gone (again, not 100%, but definitely significantly reduced).\n Docker solved packaging your app.\n But then what about production? How would your local env compare to what was running on live? There were a bunch of different \u0026ldquo;cloud app\u0026rdquo; platforms, most of which worked really well. From Heroku, one of the first ones (that git push deploy was amazing) to more docker-aware solutions (like AWS ECS). But usually they either made a lot of abstractions / assumptions that would force your application\u0026rsquo;s architecture, or they were too expensive to run them in a higher scale. After a while, a lot of organizations turned to the more \u0026ldquo;hands on\u0026rdquo; solution, Kubernetes (k8s for short).\nFrom the official website:\n Kubernetes is a portable, extensible, open source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation.\n In short, Kubernetes provides a way to take your containerized application and run it in the cloud. It\u0026rsquo;s really flexible, highly configurable, and is built from the ground up with scalability in mind. It provides ways to deploy your app, load balance it, and scale it horizontally out of the box. It lets you describe the most complex workflows and deploy them gracefully. It lets you have multiple teams working on it and isolate everyone\u0026rsquo;s permissions, both at a user and at an application level. And that doesn\u0026rsquo;t even scratch the surface of what it can do.\nBut, going back to my love of automation and standardization, the things I probably love most about it are:\n that every bit of it is configurable by code (ehm, by YAML to be fair) it\u0026rsquo;s high level, meaning you don\u0026rsquo;t have to worry about the internals of the hosts your app will run on it\u0026rsquo;s declarative, meaning you describe the state you want to be in, and it somehow achieves that state, usually quite gracefully (or, if not by default, by offering you a way to configure it to do so).  However, as you would probably expect, k8s does come with a steep learning curve and it does require a fair amount of maintenance. And while it might be a bit much for smaller teams/products, I\u0026rsquo;ve found that most of the larger companies I\u0026rsquo;ve worked for are pretty happy with it and would not consider using another technology (especially since it\u0026rsquo;s possible to use managed k8s clusters to avoid most of the admininstration).\nSo still, in my book,\n Kubernetes solved running your app.\n Why I\u0026rsquo;m doing this A few things have become increasingly clear to me over the past years and led me to start this project.\nTo gain exposure / familiarity Let\u0026rsquo;s start here: I love these two tools!\nDocker is something I work with on a daily basis and I\u0026rsquo;ve been exposed to it quite a bit. I\u0026rsquo;ve built some moderately advanced setups with docker/docker-compose and I feel I\u0026rsquo;m more or less familiar with its fundamentals, best practices and can use it pretty efficiently.\nKubernetes however, not so much. While I do work with it, most apps come with a set of pre-built Helm charts from a much more specialized DevOps / DX / SRE team and I just get to fiddle around with configuring them. While that is more or less the vast majority of what I\u0026rsquo;m asked to do with them in most situations, I\u0026rsquo;m not comfortable with having so little \u0026amp; high-level knowledge of the tool used to deploy the apps I\u0026rsquo;m building.\nHence, I need more meanigful Kubernetes exposure.\n AI rendition of Docker and \u0026ldquo;kninetes\u0026rdquo; working together\n  To experiment with more complex architectures My first challenge as a software engineer was learning to code adequately. Then, building small apps. After that, designing and building small / medium sized services.\nCurrently I\u0026rsquo;m exploring how to efficiently design, deploy and scale different services/systems that interact with each other. So, I need an infrastructure to support that and I need it to be an industry standard, something I\u0026rsquo;ll find out there. Something that can support multiple apps, natively model their interactions, their scaling and their dependencies.\nSo I need an infra to enable this sort of experimentation.\n A \u0026ldquo;piece code\u0026rdquo; growing into a \u0026ldquo;intercarting of ther\u0026rdquo;. Can\u0026rsquo;t make this stuff up!\n  To become more of a T shaped engineer I\u0026rsquo;ve spent most of my professional life building Web Applications. More specifically, their backend. Even more specifically, I\u0026rsquo;ve been building them using Python.\nAfter almost 10 years of working on something so niche, I feel like I\u0026rsquo;m kind of stagnating. And while there\u0026rsquo;s always more to learn in the technologies/stack I\u0026rsquo;ve been using, lately it feels like the return on specializing further is disproportionate to the investment required to do so.\nOn top of that, I always envisioned myself as a T-shaped engineer. Since I feel like I\u0026rsquo;ve done a decent amount of deep digging, it\u0026rsquo;s time to expand my knowledge horizontally. Now, one would think I\u0026rsquo;d need to explore more frontend technologies, but since I\u0026rsquo;ve always felt closer to the infra side of things, this feels like an easier step to take.\nOK, maybe I\u0026rsquo;ll become an L-shaped engineer first 😅, but still I need more infrastructure exposure.\n An L-shaped Software Engineer. That\u0026rsquo;s not too bad!\n  It makes sense to DIY this Now, a reasonable question would be why not do this with a managed k8s cluster. Most cloud providers offer them, and it would save a reasonable amount of time and effort. And that\u0026rsquo;s in fact what I did over the past year. I built a managed cluster, and started adding components to it. However, I ran into cost issues.\nA very small cluster on DigitalOcean (probably one of the cheapest decent providers out there) with a bare minimum of 2 nodes costs around 36$/month. Add a load balancer for 12$/month, some storage for a few bucks and a managed database for 15$ a month, and you easily get close to 70$ month. That, combined with the fact that I don\u0026rsquo;t have a dedicated amount of time to spend on the project, but I rather whatever free time I can find here and there on it, increased the cost of having something permanently deployed on a cloud provider significanly. After 1 year of paying for it, I summarized the costs and it was close to what I\u0026rsquo;d have to spend to build a home lab myself. Plus, building it myself would contribute much more to my goal of improving my infra skills.\nThis is obviously the dilemma we all face in different aspects of our lives: Buy it or DIY it? But in this case, taking the time to build it myself contributed to a greater goal as well. So it became clear that DIY was the way to go.\n A DIY k8s cluster. Don\u0026rsquo;t ask.\n  Because it\u0026rsquo;s fun Last but not least, I love building things! My favourite toys growing up were LEGOs. I loved the complexity! Some of my happiest childhood memories are when I would get these shipments from my grandparents who lived abroad with LEGOs that were not available in my country. I would spend hours or even days building them, only to never play with them again afterwards because, duh, building them was the most fun part!. I\u0026rsquo;ve been longing for that feeling ever since, trying to find my \u0026ldquo;grown up LEGO set\u0026rdquo;.\n25 years later, and being a software engineer, this feels like the closest thing to it.\nHuge bonus: I managed to sneak some 3D design/printing to the process, which made it about a million times more fun.\n Kids playing with LEGO, adults with software.\n  So, what are we building?  The cluster blinking its lights! Ain\u0026rsquo;t it pretty?\n  Right, let\u0026rsquo;s get to the point!\nHere\u0026rsquo;s what I\u0026rsquo;m building:\n a k8s cluster on top of 4 (or more) Raspberry Pi 5 nodes, with 8GB of RAM and NVMe storage (via a HAT adapter)  Ubuntu Server for ARM will be the OS, as it comes with cloud-init and a lot of dependent kernel modules for things like storage etc   I\u0026rsquo;ll manage this with either IaC solutions (most likely Pulumi or maybe Terraform) or tools like kubespray (Ansible)  various components / apps will be deployed with Helm   I\u0026rsquo;m aiming to share scripts or parts of my solutions (probably not the whole thing for security reasons)  I\u0026rsquo;ll share how I bootstrap the nodes, possibly even some cloud-init configuration to help with that   Grafana Cloud will be our monitoring tool of choice The cluster will include  load balancer(s) (probably MetalLB), to expose services outside the cluster in-cluster certificate signers for HTTPS maybe a block storage provider in-cluster (like Longhorn or rook-ceph, still TBD) cloud-native database(s) (using solutions like Crunchy-Postgres) possibly multiple namespaces, for different apps I\u0026rsquo;ll be deploying a self-hosted container registry, where application containers will be pushed \u0026amp; pulled from   I\u0026rsquo;ll also explore security and possibly deploy solutions to enforce it within pods  This is an active project, so things may change. But since it\u0026rsquo;s been somewhat built already, I am fairly confident about most of the above. Expect changes though, as I build more components and understand the topic model deeply.\nSome bonus topics I hope to cover:\n Building the cluster\u0026rsquo;s rack mount using 3D printing (and designing around our limitations) The cost of the entire thing, both building \u0026amp; running, with links to components to help you get started (no affiliate links) Network setup, if I manage to buy a proper router (I\u0026rsquo;ve been eyeing this Mikrotik CCR2004-16G-2S+PC) and build a proper network for my homelab Administrative operations, like e.g. adding a new node, shutting down the cluster, backing up a database etc  Lots of stuff, so let\u0026rsquo;s see where it takes us!\nFormat I\u0026rsquo;ll be sharing separate posts for each thing I\u0026rsquo;m working on. The format will be something between a journal and a tutorial. This may cause things to get repeated/changed as we go along. I\u0026rsquo;ll do my best to update older posts if newer ones deprecate them, but most likely things will slip up. I hope to open-source scripts \u0026amp; configuration snippets along the way, but as I said, not everything will be open sourced most likely. I\u0026rsquo;ll be updating this page with links to individual posts as I go. The timeline I\u0026rsquo;m aiming for will be 1 post per month, as my time is limited and writing them takes a long time!\nConcluding This has been a long, rather abstract post and thank you for reading this far! While you should probably expect the next ones to be mainly technical, I felt it was important to explain the need for this. Telling people that I planned to run a k8s cluster myself always led to strong \u0026ldquo;Why?\u0026rdquo; questions. I believe that explaining my reasons might inspire others to follow this path. I\u0026rsquo;ll also share the problems I encountered along the way and the effort required to reach each milestone.\nIf you\u0026rsquo;re interested in following the process, subscribe to my mailing list to make sure you don\u0026rsquo;t miss a post. If you\u0026rsquo;ve already done so, thank you! I\u0026rsquo;ve never actually sent ut anything, but I hope to start sending out a monthly digest from now on. So now\u0026rsquo;s a good time to consider it!\nSee you soon (hopefully) 👋 !\n","permalink":"https://iamsafts.com/posts/homelab-intro/","summary":"Why I\u0026rsquo;m building a Home Lab using Raspberry Pis, Kubernetes and 3D printing.","title":"🧪 Building a Home Lab with Raspberry Pis and k8s"},{"content":"In web applications it\u0026rsquo;s not rare to face performance issues that we can\u0026rsquo;t quite understand. Especially when working with databases, we treat them as this huge \u0026ldquo;black box\u0026rdquo; that 99% of the times works amazingly without us even caring about it. Heck, we even use stuff like ORMs that essentially \u0026ldquo;hide\u0026rdquo; our interaction with the database, making us think that we don\u0026rsquo;t need to care about this stuff.\nIf you\u0026rsquo;re developing something small, contained, simple then this is probably the case. Your database will most likely perform OK no matter how poorly designed or configured it might be. You won\u0026rsquo;t have any issues using \u0026ldquo;naive\u0026rdquo; queries built by ORMs, everything will work just fine.\nHowever, as your application grows, then the database is something that can make or brake you. It\u0026rsquo;s one of the hardest things to scale (you can\u0026rsquo;t just spin up multiple instances), it\u0026rsquo;s hard to re-design or migrate and essentially it\u0026rsquo;s the core of your application, serving and storing all your data. Hence its performance is critical.\nIn this post I\u0026rsquo;ll showcase a real-life example of debugging a seemingly weird database performance degradation. While I obviously intend to share the solution and what to avoid, I\u0026rsquo;d also like to take you through the journey and show you some tools \u0026amp; processes that can help you dig into SQL performance.\nLet\u0026rsquo;s go!\nThe system \u0026amp; the problem The database we\u0026rsquo;ll be studying is an AWS Aurora RDS (PostgreSQL 12). It is a clustered database and has two replicas, a reader (read-only replica) and a writer. AWS Aurora is pretty close to an actual PostgreSQL with some zero-lag replication capabilities on top (and some managed features of course). The whole process discussed here should apply to a self-managed RDS PostgreSQL as well.\nThe problem we will be studying is the (seemingly) random poor performance of UPDATE / INSERT statements. This was observed in a specific table, that had ~20000000 rows and 23 indexes.\nSo while, most writes (\u0026gt;99.99%) take \u0026lt;10ms to complete, some statements were taking more than 40 seconds. Some even ended up being killed by the statement_timeout setting (which was set at 100s!). It was baffling to say the least.\nAssumptions Since this wasn\u0026rsquo;t consistently reproducible, several assumptions were made:\nToo much write volume The first assumption was that there was just too much write volume on the database. While this is partially true, our evidence didn\u0026rsquo;t support that this could be the root cause of the problem since it occured uniformly both in periods with high write volume but also in periods where write volume was minimal.\n Slow writes occured both in idle times (06:00 - 08:00) and in high load intervals (13:00 - 15:00) with similar frequency\n  Batch writes Quite often the application was writing rows in batches. This was also considered a possible cause and batched writes were removed. However, the problem remained (and the performance overall was worse when writing one row at a time).\nToo many indexes This table had numerous access patterns, and hence it required numerous indexes to perform well. It had 23 B-Tree indexes, 6 Foreign-key constraints, 3 BRIN indexes and 1 GIN index (for full text search). While it is clear that indexes play a role in write performance (since for every write you need update every index), but this didn\u0026rsquo;t explain why most updates were really fast and some excruciatigly slow.\nLocks The last assumption was that there maybe were competing locks in the database. Specifically, maybe some long running processes opened big transactions and locked resources for a long time. Then, other writes were waiting to update the locked rows and couldn\u0026rsquo;t finish. This seemed like a good assumption and it couldn\u0026rsquo;t be disprooved with the data at hand. So it was time for further investigation\nValidating To help us check our assumptions, PostgreSQL offers some tools. Those can be enabled via its configuration (postgresql.conf or the RDS parameter groups in AWS). Some interesting options are:\n log_lock_waits: Enabling this will instruct the Deadlock Detector to log whenever a statement exceeds deadlock_timeout. There is no performance overhead enabling this, since the Deadlock Detector should be running anyway, and if it is, then it\u0026rsquo;s practically free (source) log_min_duration: This will log queries running for more than X ms auto_explain: This is actually a number of configurations (auto_explain.log_min_duration, auto_explain.log_analyze etc). They control when and how PostgreSQL will automatically perform an EXPLAIN on running queries. Those are useful as a precaution too, to make sure that poorly performing statements will leave traces \u0026amp; query plans for you to debug. You can read more here log_statement: This is pretty useful. It can enable logging all / most / error statements etc. If you want to find out if something\u0026rsquo;s wrong with your database, it\u0026rsquo;s a common practice to set this to all for some time, gather the output and analyze it with a tool like pgBadger. You can see all the logging-related options here  So, I went to enable those, reproduce the issue and see what the heck is going on.\n .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   What is this EXPLAIN you keep mentioning?\nIn short, EXPLAIN will show us the execution plan the PostgreSQL query planner choses when running a query. It will show which strateges are used, when it JOINs, which indexes are used and some metrics about costs etc. If you are not familiar with the EXPLAIN statement, I will be using it a bit below so you can take a quick read at this to get a basic understanding.\n Reproducing the issue However, while testing something irrelevant on our staging instance, I managed to reproduce the issue predictably. To do that, all I had to do was to update 5000 rows on this table. While doing that, once or twice every 5000 updates, an update would take \u0026gt; 40sec!\nThis was a true blessing, because it ruled out both the \u0026ldquo;too much write volume\u0026rdquo; hypothesis (our staging DB had zero traffic) and the long locks as well, since there were no processes locking the rows I was updating.\nI performed an EXPLAIN ANALYZE on the problematic query to see what is going on\nThe query\nEXPLAIN (ANALYZE VERBOSE BUFFERS COSTS) UPDATE \u0026#34;my_table\u0026#34; SET \u0026#34;match_request_id\u0026#34; = \u0026#39;c607789f-4816-4a38-844b-173fa7bf64ed\u0026#39;::uuid WHERE \u0026#34;my_table\u0026#34;.\u0026#34;id\u0026#34; = 130561719; The query plan for the fast execution\nUpdate on public.my_table (cost=0.43..8.45 rows=1 width=832) (actual time=2.037..2.037 rows=0 loops=1) Buffers: shared hit=152 read=1 I/O Timings: read=1.22 -\u0026gt; Index Scan using my_table_pkey on public.my_table (cost=0.43..8.45 rows=1 width=837) (actual time=0.024..0.026 rows=1 loops=1) Output: (...) Index Cond: (my_table.id = 130561719) Buffers: shared hit=4 Planning Time: 1.170 ms Execution Time: 2.133 ms and the one for the extremely slow runs\nUpdate on public.my_table (cost=0.56..8.58 rows=1 width=832) (actual time=34106.965..34106.966 rows=0 loops=1) Buffers: shared hit=431280 read=27724 \u0026lt;----- THIS IS HUGE!! I/O Timings: read=32469.021 -\u0026gt; Index Scan using my_table_pkey on public.my_table (cost=0.56..8.58 rows=1 width=832) (actual time=0.100..0.105 rows=1 loops=1) Output: (...) Index Cond: (my_table.id = 130561719) Buffers: shared hit=7 Planning Time: 23.872 ms Execution Time: 34107.047 ms We could easily note the following:\n the predicted cost was the same in both cases (although running time clearly wasn\u0026rsquo;t) the 2nd case ended up reading ~450k buffers!  The last one was a clear indicator that there was an issue. But I couldn\u0026rsquo;t figure out what was causing this. I started doing various experiments, hoping to mitigate it. I tried:\n Doing a VACCUM FULL (ref) hoping that maybe this happened because AUTOVACUUM didn\u0026rsquo;t function well. Sadly, no result. Doing an ANALYZE on the table to force PostgreSQL to update its stats and maybe execute the query more efficiently. Again, no luck. Dropping and recreating the involved index. This didn\u0026rsquo;t work either.  At this point I was pretty much out of ideas.\nGetting answers Before giving up, I decided to reach out to the masters. I wrote the following post on the DBA StackExchange, which is a targeted community for Database Administrators and developers working with databases.\n .quoted { background: var(--entry); padding: 10px 20px; margin-bottom: 40px; border-radius: var(--radius); display: flex; flex-direction: column; flex-grow: 1; width: 100%; } .link-container { display: flex; flex-direction: row; align-items: center; } .image-div { margin-right: 5px; max-width: 10%; } .ask-ubuntu-logo { background-color: #DC461D; padding: 5px !important; }                    UPDATE on big table in PostgreSQL randomly takes too long     What amazed me was that even before I explained the specifics of my case, people were asking in the comments if my table had a GIN index. Actually, they were pretty sure that it had one. Moreover, they suggested that I had something called fastupdate enabled.\nThat led me to the documentation (once again). Let\u0026rsquo;s quote a bit from it:\n GIN Fast Update Technique\nUpdating a GIN index tends to be slow because of the intrinsic nature of inverted indexes: inserting or updating one heap row can cause many inserts into the index (one for each key extracted from the indexed item). As of PostgreSQL 8.4, GIN is capable of postponing much of this work by inserting new tuples into a temporary, unsorted list of pending entries.\n This described our case 100%. Most writes, since they didn\u0026rsquo;t trigger a cleanup of the pending list, were blazing fast. However, when the pending list cleanup was triggered (its size grew more than gin_pending_list_limit) the process that performed the write blocked until the pending list was cleaned up and the index was synchronized.\n The action that triggered a cleanup blocked until the contents of the list were persisted to the index. Note that SELECTs also have to go through the pending list contents, which are unsorted so there may be a performance overhead there as well\n  Experimentation I went on to check if my index had fastupdate set. This is an option in the index storage parameters. To check that, you can use \\d+ \u0026lt;index_name\u0026gt; in psql. I didn\u0026rsquo;t see anything there, but reading up on the CREATE INDEX command I noticed that fastupdate was ON by default. I switched it off to do some tests:\nALTER INDEX \u0026lt;index_name\u0026gt; SET (fastupdate=off);   .alert { padding: 15px; margin-bottom: 20px; border-radius: 4px; } .alert h4 { margin-top: 0; color: inherit; } .alert .alert-link { font-weight: bold; } .alertp,.alertul { margin-bottom: 0; } .alertp+p { margin-top: 5px; } .alert-success hr { border-top-color: #c9e2b3; } .alert-success .alert-link { color: #2b542c; } .alert-info hr { border-top-color: #a6e1ec; } .alert-info .alert-link { color: #245269; } .alert-warning hr { border-top-color: #f7e1b5; } .alert-warning .alert-link { color: #66512c; } .alert-danger hr { border-top-color: #e4b9c0; } .alert-danger .alert-link { color: #843534; } .alert { border-radius: 0; -webkit-border-radius: 0; } .alert .sign { font-size: 20px; vertical-align: middle; margin-right: 5px; text-align: center; width: 25px; display: inline-block; } .alert-white { background-color: #2e2e33; padding-left: 61px; position: relative; } .alert-white .icon { text-align: center; width: 45px; height: 100%; position: absolute; top: -1px; left: -1px; border: 1px solid #bdbdbd; } .alert-white .icon:after { -webkit-transform: rotate(45deg); -moz-transform: rotate(45deg); -ms-transform: rotate(45deg); -o-transform: rotate(45deg); -webkit-transform: rotate(45deg); display: block; content: ''; width: 10px; height: 10px; border: 1px solid #bdbdbd; position: absolute; border-left: 0; border-bottom: 0; top: 50%; right: -6px; margin-top: -5px; background: #fff; } .alert-white.rounded { border-radius: 3px; -webkit-border-radius: 3px; } .alert-white.rounded .icon { border-radius: 3px 0 0 3px; -webkit-border-radius: 3px 0 0 3px; } .alert-white .icon i { font-size: 20px; color: #FFF; left: 12px; margin-top: -10px; position: absolute; top: 50%; } .alert-white.alert-danger .icon,.alert-white.alert-danger .icon:after { border-color: #ca452e; background: #da4932; } .alert-white.alert-info .icon,.alert-white.alert-info .icon:after { border-color: #3a8ace; background: #4d90fd; } .alert-white.alert-warning .icon,.alert-white.alert-warning .icon:after { border-color: #d68000; background: #fc9700; } .alert-white.alert-insight .icon,.alert-white.alert-insight .icon:after { border-color: #e7c100; background: #fcd200; } .alert-white.alert-success .icon,.alert-white.alert-success .icon:after { border-color: #54a754; background: #60c060; }   Changing index storage parameters\nBe careful when running statements like the one above. For one, this will trigger a lock until the index storage parameters are changed. Moreover, disabling fastupdate means that you will manually have to cleanup the pending list too (using SELECT gin_clean_pending_list()) or rebuild the index (using REINDEX). Both cases will probably cause performance or integrity issues in a production system, so be careful.\n Voila! The problem was gone. Every write took the same, predictable time. However, as expected, it was noticably slower. So I was reluctant to consider disabling fastupdate altogther.\nPossible solutions At this point, a complete solution had been submitted in my StackExchange post and I saw some other more viable options:\n I could run VACUUM more aggressively, hoping that it will cleanup the pending list on the background and my queries will never trigger a cleanup. However, I don\u0026rsquo;t think this would be 100% reliable again. I could set an even higher gin_pending_list_limit (default: 4MB). This would mean that cleanups would be really rare but it could impact SELECT statements (they have to read the pending list too) and if a cleanup occured it would take huge amounts of time. I could set a background process to perform a SELECT gin_clean_pending_list() periodically. However, much like option 1 this would not guarantee anything I could set a smaller gin_pending_list_limit so that cleanups are more often but take less time.  I decided to go with the last, and ran some experiments to see how this would impact the system. Out of curiosity, I even dropped the index to see how much it affected write performance. You can see some results below:\n    no GIN index GIN index (without fastupdate) GIN index (with fastupdate \u0026amp; threshold 4MB) GIN index (with fastupdate \u0026amp; threshold 128KB)     1 update: best case \u0026lt; 1 ms 50 ms \u0026lt; 1 ms \u0026lt; 1 ms   1 update: worst (observed) case   3m 10s (with gin_pending_list flush) 4 s (with gin_pending_list flush)   5000 updates (non-batched) 2min 6m 30s 7m 7m   5000 updates (batched, 20) 1min 7m 7m 7m    Some interesting insights:\n Average time of inserting 5000 rows is the same without fastupdate and with any size of gin_pending_list_limit, which is expected. Updates that don\u0026rsquo;t trigger a cleanup take the same time, no matter how big or small gin_pending_list_limit is (again, expected). With a value of 128KB, updates that triggered a cleanup took 4sec, which was very tolerable When the index was dropped, we saw a huge performance boost (3x faster with non-batched updates and \u0026gt;6x faster with batched!)  Solving the issue By experimentation, 128KB seemed like a good value. So I chose to proceed this way.\nNow, there were various ways to set the gin_pending_list_limit:\n Via postgresql.conf (or DB parameter groups in AWS RDS). This affects all GIN indexes. In AWS RDS it doesn\u0026rsquo;t require a restart (it\u0026rsquo;s a dynamic parameter). However, if you\u0026rsquo;re running a self-managed PostgreSQL you\u0026rsquo;ll most likely need to restart for the changes in postgresql.conf to take effect By altering the index storage parameters (ALTER INDEX \u0026lt;index_name\u0026gt; SET (gin_pending_list_limit=128)). But this could cause a number of issues (see the note above) By altering the gin_pending_list_limit for the specific user (ALTER USER \u0026lt;user_name\u0026gt; SET gin_pending_list_limit=128). This would affect all new connections and wouldn\u0026rsquo;t require a restart.  Personally I\u0026rsquo;d choose the first one. In this case I had to go with the latter because of some unrelated issues. But they all would do the trick.\nThe next day After monitoring for 1 week, there were no random failing writes which was an amazing relief since the issue had been there forever. The whole process took about a week and apart from gaining knowledge on GIN index internals, it also provided some insight on how much a GIN index can affect write times and triggered a discussion for reconsidering full text search in PostgreSQL.\n  Enjoyed this article? Feel free to share it! You can also subscribe below.   Have suggestions? I'd love to hear from you! Don't hesitate to reach out in any of my social channels.  ","permalink":"https://iamsafts.com/posts/postgres-gin-performance/","summary":"In web applications it\u0026rsquo;s not rare to face performance issues that we can\u0026rsquo;t quite understand. Especially when working with databases, we treat them as this huge \u0026ldquo;black box\u0026rdquo; that 99% of the times works amazingly without us even caring about it. Heck, we even use stuff like ORMs that essentially \u0026ldquo;hide\u0026rdquo; our interaction with the database, making us think that we don\u0026rsquo;t need to care about this stuff.\nIf you\u0026rsquo;re developing something small, contained, simple then this is probably the case.","title":"Debugging random slow writes in PostgreSQL"}]