• WSL2, DNS tunneling, and .local names

    Just copying the text of a post I made to Reddit. Long story short: if you want .local domain names to work in WSL2, you may have to turn off DNS tunneling.

    Query

    I’ve got several Linux boxes on my LAN running avahi-daemon. I can connect to machine.local fine on most of my LAN, including my Windows desktop in a web browser or DOS shell. However, my WSL2 system cannot connect to machine.local, it seems unable to resolve mDNS at all.

    I tried finding an answer and found several years’ worth of advice, some conflicting. The root cause seems to be WSL2’s complicated networking setup interfering with multicast. Unfortunately I don’t really understand WSL2’s network setup. Is there some easy way to get mDNS resolving working?

    Comment I wrote with solution

    Setting dnsTunneling to false makes .local names work. I had to restart the WSL2 system as well.

    DNS tunneling is described here and the configuration file is described here. There’s also this bug report on GitHub.

    Before the change my WSL2 had 10.255.255.254 as the nameserver in the resolv.conf, now it has 172.31.0.1. DNS tunneling is relatively new (from Windows 11 22H2) and “is aimed to improve compatibility with VPNs, and other complex networking set ups”. So maybe turning it off causes other problems. But basic DNS and Tailscale’s Magic DNS are both still working for me and now .local names work too, so I’m happy.

  • Setting up a new Proxmox server

    Some notes on setting up a new Proxmox server for the non-profit I’m helping. There’s almost nothing to do, got to be the simplest install ever. Still a few steps:

    Hardware tasks.

    1. Enable BIOS to boot on power restoration
    2. Quick run of memtest from Proxmox boot ISO
    3. Quick run of s-tui from Linux to heat test

    Basic Proxmox setup

    1. make a USB installer
    2. boot it and use the graphical install
    3. installer options:
      choose ZFS Raid0 for the simple filesystem on the SSD
      choose UTC timezone
      type a simple to remember root passwrd
    4. after reboot, open web console and set root password to a strong password from 1Password

    Root shell customizations

    Make as few changes as possible on the PVE itself. But a few are useful, log in with ssh as root.

    1. Add an SSH key to ~root/.ssh/authorized_keys
    2. Run the tteck post install script and accept all its defaults
    3. Run the tteck microcode installer, I picked 3.20240514.1~deb12u1
    4. Install Tailscale
    5. Run pveam update to update the list of templates for containers

    GUI customizations

    1. node > System > DNS: update search domain and DNS servers
    2. node > local > CT Templates: pick templates for containers

  • Android Doze mode vs. notifications

    Still trying to understand why my fancy Google Pixel doesn’t show me notifications in a timely fashion. I got some new insight into a possible cause from some Reddit comments: Android Doze mode.

    Long story short, Doze is a power saving system w here the phone only wakes up every 15-28 minutes to process things in what they call “a maintenance window”. Normal notifications get delayed too, although there is an option for high priority notifications from apps. There’s no way to configure it with the usual Android settings. There are ADB hacks you can apply (and have to re-apply every reboot): see here or here for detailed discussions and solutions. The key thing is the command dumpsys deviceidle disable.

    But Doze has been around a very long time, since Android 6. I only started noticing a problem around a year ago, maybe around Android 13 or 14. Did something change?

    I’ve got no problem with low power modes! But they can’t interfere with basic use of the device. It’s particularly vexing for me since my phone spends most of its time sitting on a Pixel Stand at 100% battery, power saving is not my problem. (Which suggests one possible change: Pixels have gotten recent changes in battery charge optimization that means the phone may not realize it has a source of charging power when it’s near 100%.)

    Another possible explanation for a change is that high priority messaging thing. It’s hard to imagine but maybe Gmail, Signal, and Messages aren’t getting high priority messages anymore and so don’t wake up? That’d be dumb. But maybe it’s a bug / bad system interaction?

    Update: Gmail is definitely not getting reliable updates. Twice today I only got updates when I woke the phone up maually. I have no way of knowing what priority Gmail sends its notifications at but it’d sure be dumb if it weren’t high. Gmail has a zillion options for configuring its notification behavior. Most of them were turned on but “Notify for every message” was off. I’ve turned it on, now to wait a few days to see if it makes a difference. (No immediate improvement, just had more mail not show up as notifications. Even after I woke the phone, still took 30 seconds or more.)

    Mostly I’m just tired and cranky about this bug. It’s affecting a huge number of Android users all over the world. It has to be obvious to any Google engineer who uses an Android phone regularly. Do they just think it’s OK? Can’t reproduce it? Don’t care?

    This is the point where I’d be the hero who does a lot of testing and figures out the problem himself, finds a fix. But fuck that, I’ve got better things to do and don’t really know enough to tackle this one myself. It’s particularly frustrating because it’s hard to reliably reproduce the problem, mostly I notice when I haven’t seen an email I know I got about once a day. Anyway if I do get the interest to tinker again, things to try:

    • the ADB hacks
    • turn off Google’s battery saving charging, see if it works better on the charging stand.

    Update: ntfy.sh

    Got some interesting perspective reading ntfy.sh’s docs.

    Instant delivery allows you to receive messages on your phone instantly, even when your phone is in doze mode, i.e. when the screen turns off, and you leave it on the desk for a while. This is achieved with a foreground service, which you’ll see as a permanent notification that looks like this: ….

    Limitations without instant delivery: Without instant delivery, messages may arrive with a significant delay (sometimes many minutes, or even hours later). If you’ve ever picked up your phone and suddenly had 10 messages that were sent long before you know what I’m talking about.

    The reason for this is Firebase Cloud Messaging (FCM). FCM is the only Google approved way to send push messages to Android devices, and it’s what pretty much all apps use to deliver push notifications. Firebase is overall pretty bad at delivering messages in time, but on Android, most apps are stuck with it.

    The ntfy Android app uses Firebase only for the main host ntfy.sh, and only in the Google Play flavor of the app. It won’t use Firebase for any self-hosted servers, and not at all in the the F-Droid flavor.

  • Proxmox / KVM cpu types

    When you create a VM in Proxmox it gives you a bunch of options for what CPU to emulate. The right choice is probably “Host” if you’re not using Proxmox’ machine migration features.

    I’ve always chosen the default, x86-64-v2-AES QEMU, without knowing why. There’s a zillion options for various small differences in Intel and AMD CPUs, also QEMU virtual processors. Turns out the right setting for me is “Host”. That means no particular emulation, just pass through whatever the hardware has.

    It should be more performant but I’m not sure how much. I doubt the emulated CPUs are doing a full interpreter environment, but OTOH they talk about emulating microcode so it has to be at least somewhat complicated. This blog post has some measurements showing Host is 3-10% faster than KVM64.

    This is all well documented, see “CPU Type”. The important parts:

    Usually you should select for your VM a processor type which closely matches the CPU of the host system, as it means that the host CPU features (also called CPU flags ) will be available in your VMs. If you want an exact match, you can set the CPU type to host in which case the VM will have exactly the same CPU flags as your host system.

    Why isn’t this the default? Because it makes migration awkward.

    This has a downside though. If you want to do a live migration of VMs between different hosts, your VM might end up on a new system with a different CPU type or a different microcode version. If the CPU flags passed to the guest are missing, the QEMU process will stop. To remedy this QEMU has also its own virtual CPU types, that Proxmox VE uses by default. … If you don’t care about live migration or have a homogeneous cluster where all nodes have the same CPU and same microcode version, set the CPU type to host, as in theory this will give your guests maximum performance.

    With the QEMU CPU, /proc/cpuinfo in the VM shows “QEMU Virtual CPU version 2.5+”. With the CPU type set to “Host” it shows whatever the actual hardware has (ie, “Intel(R) N100”).

    I’ve apparently been running one VM as Host for months now with absolutely no problems. I converted the other one from QEMU to Host.

  • Linux network interfaces can have more than one IP address

    Had to re-learn this. Normally a network interface in Linux has a single IP address. But there’s nothing stopping you from adding several.

    Adding a second address is as simple as running something like ip addr add 192.168.3.254/24 dev eth0.

    Here’s a config I just set up in Alpine in /etc/network/interfaces. Not sure what the netplan equivalent looks like.

    auto eth0
    iface eth0 inet static
            address 192.168.9.5/24
            gateway 192.168.9.1
            hostname $(hostname)
    
    auto eth0:0
    iface eth0:0 inet static
            address 192.168.3.254/24

    More details in this Reddit post where I originally tried to make it much more complicated.

  • A good Claude programming example

    I’ve been having such a good time using Claude 3.5 Sonnet to help me program and learn APIs I’m not good at. See this transcript for an example, where Claude walks me through setting up a basic Python command line project, then adds pytest to it, then adds click, then makes it work in VS.Code

    I taught myself how to do all these tasks three months ago. But it was a tedious process of finding the right updated docs, then experimenting a lot, and in the end not quite getting it right. A typical pre-AI programming experience. It did only take me an hour or two, and thirty minutes the second time, but still it was tedious.

    Claude just magically gives me the answers. Here’s the catch: I’m not certain they’re right. I didn’t review this transcript closely or run the code. I did read it all though, and it is plausible. I am confident I could start with what Claude did for me, maybe fix a few little problems, and have a working thing. But it’s very important an expert programmer still be in the loop, you can’t just trust the code it emits. I’m fine with that, it saved me a lot of time.

    What’s remarkable is it gets so many core concepts I learned three months ago right:

    • the file structure with src/hello_world_cli/__init__.py and tests/
    • the contents of pyproject.toml
    • using optional dependencies for pytest, etc
    • the structure of a decent README.md
    • click’s decorators for CLI programs, including an example of a switch
    • the use of CliRunner to test click commands in pytest
    • the need for a __name__ block to make it run under VS.Code debugger
    • it even incremented the tool version number after making changes!

    It even taught me some new tricks:

    • using tool.pytest.ini_options to set up paths
    • using hatchling instead of setuptools
    • creating a launch.json file for .vscode

    I know that all that’s happening behind the scenes is a very big neural network is predicting token outputs. But what it looks like to me is Claude knows a lot about programming including the use of several complex open source tools. And I can mostly chat with Claude pretending it is intelligent and understands what I’m asking and it will give me useful answers.

    The limitation is I asked it precise questions about a relatively basic thing with a whole lot of good code examples and docs on the Internet. It wouldn’t do nearly as well with some more obscure libraries for which there are no good examples. It will have a harder time with more complex code. And it certainly won’t invent new things, Claude is not going to create a new algorithm for distributed consensus or something.

    But for basic stuff it sure is a lot better than doing Internet searches and reading through docs and example code myself. It makes programming way more fun for me by eliminating a lot of tedium.

  • Booting Windows 11 23H2 to Steam Big Picture

    I want my new TV gaming PC to boot all the way in to Steam Big Picture with no keyboard required. This is remarkably difficult and all the docs you find are outdated because Win 11 23H2 changed some things. Microsoft really doesn’t want you do this. Partly for good reason (it removes local system security) and partly because they are trying to force you into their monopoly experience.

    This worked for me:

    1. Log in locally, not with Microsoft account

    Settings > Accounts > Your info has a “Sign in with a local account instead”. Turn that on.

    Documented here, at least until Microsoft hides the docs again. When I did this I got warned I needed to back up my Bitlocker encryption key first, I guess because ordinarily that’s stored in your Microsoft key. I just removed Bitlocker entirely instead.

    2. Re-enable the “no password” UI option

    Go in RegEdit to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\PasswordLess\Device and set the value DevicePasswordLessBuildVersion to 0. (Mine was 2).

    This is a new thing in 23H2, documented here. If you don’t, the next step won’t work.

    3. Enable passwordless login

    Run the program netplwiz. Uncheck the checkbox at the top for “Users must enter a user name and password to use this computer”.

    Documented here. This doesn’t work in 23H2 if you don’t do step 2 to re-enable it. Why the stupid program name? It started life as the “Add Network Places Wizard”, terminology from Windows 95.

    4. Get Steam to start at boot in Big Picture mode

    Go to Steam Settings > Interface. Enable “Run Steam when my computer starts” and “Start Steam in Big Picture Mode”.

    Documented here. The web has all sorts of complicated docs with registry edits, batch files, replacing the Windows shell, etc… None of that is necessary.

  • NFS still a tar pit in 2024

    I was dorking around with a new NFS server and ran mount testserver:/test /tmp/test on my main computer. Then I turned off the testserver VM and forgot about it. Classic mistake, one I’ve been making occasionally for over 30 years.

    The NFS disk in /tmp/test was still mounted on my client machine. And because I supplied no options to the mount command, it was a hard mount by default. Which means anything that accesses /tmp/test will get stuck in device wait, very hard to interrupt. I first noticed this because my restic backups were failing. It tried to back up /tmp/test and got stuck and the backup failed. (Why am I backing up /tmp? Lol.)

    The usual fix is to supply some options to the NFS mount so it’s interruptible or has a timeout. Some options to know of: soft, timeo, retrans, intr. My fstab mounts things with this cargo cult set of options: x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10s. No idea where I got that, not sure it even actually works. Reading docs now I think that 10s timeout is for the initial mount only. ANd why does systemd need to be involved with its own bizarro thing? Who knows.

    I also don’t understand why NFS doesn’t have some basic keepalive / server liveness test in its client. Maybe because NFS was originally connectionless and sort of loosey-goosey? I thought NFSv4 made all that much better. Maybe there’s some way to configure it. But the default options on Linux in 2024 are a tar trap.

    I tried a bunch of things to unwedge my client. umount -f, which fails. lsof to see what processes had it open preventing the umount, which hangs forever. I finally gave up and just rebooted the NFS client. Awesome.

    The underlying problem is network resources are not local resources. Software likes to assume that commands to a hard drive will either succeed or fail, not just hang indefinitely. But with a network filesystem they absolutely can hang. Even with a local one: modern hard drives are complex devices.

  • First impression of NixOS

    I thought I’d give NixOS a spin after chatting with Jeff N. I had a great experience, I went from zero knowledge to a working web server in under two hours. And that includes lots of time writing this blog post, reading Mastodon, etc. And it was fun too!

    I got this done mostly by asking Claude to tell me what to do: who has patience to read docs? It’s fun to learn something new this way. Also using this Wiki page and this baby steps tutorial. Julia’s notes are good too.

    Update: I did this all as a Proxmox VM. There’s also notes on installing NixOS in a Proxmox LXC container.

    Some basic concepts

    I’m still trying to work my head around what Nix does. My impression is it’s similar to a Dockerfile or a puppet/ansible script, it’s a config file that creates a working Linux environment for stuff to run in. The difference is Nix has fancy design so everything’s managed, rollbacks are possible, etc.

    NixOS is not Nix. Nix is the package manager, NixOS is an operating system built using the Nix package manager. Nix is like apt, NixOS is like Ubuntu.

    The master configuration for NixOS and all the things running in it (like a web server) is in /etc/nixos/configuration.nix and has its own syntax. Almost the whole system is configured here. Nix folks have written little shims to take configuration directives from this file and write them into whatever weirdo format the ancient Unix utility might expect. (I hear the sendmail.cf generator is sentient but no one can understand what it’s saying.) You edit configuration.nix, check it into git, etc. You use nixos-rebuild switch to tell the system to implement the changes.

    I’m looking at NixOS for a whole system image but there’s also a way to use Nix for your personal user environment that’s broadly similar.

    There’s an effort to make bit-for-bit reproducible builds for NixOS which has nice affordances for security. But that’s not the only reason to use NixOS, more a happy side effect.

    Install experience

    First attempt for an install was with the minimal NixOS ISO (still 1GB!). But they aren’t kidding about minimal, this thing has you running fdisk yourself and leaves you with an unconfigured system. The graphical installer (3.5GBish) partitions a disk and gives you a basic working system when it’s done, so that’s easier.

    After the default install the system reboots. 2.7GB is consumed on the hard drive and there’s 89 packages (not counting documentation packages). There’s about 10 processes running other than the kernel and all its threads. They all run out of /nix/store. They include

    • systemd running journald, udevd, oomd, timesyncd, logind
    • dbus-daemon
    • nsncd: a shim for name service (this one?)
    • Networkmanager

    That’s a remarkably small system.

    /bin and /usr/bin are nearly empty. There’s about 10 directories in my $PATH but much of what I’d expect to find is in /run/current-system/sw/. All the binaries there are symlinks into /nix/store. They’ve been dynamically linked with libraries like libc referencing paths in /nix/store.

    NixOS installs a fairly complete Linux environment, at least it has less, bash, rsync, ssh. No sshd, python, or vim though. So why 2.7GB? As near as I can tell there’s a lot of stuff in /nix/store that’s not in my current environment, like a whole install of python3.11. Those binaries seem to work if I run them directly but I’m guessing I’m supposed to configure the NixOS system to officially use them if I want them.

    The /nix/store can apparently get very large but there’s ways to manage it. I’m not sure if there’s an easy way to say “prune everything off the disk I don’t actually use” to get the install size down to 100MB or so.

    Installing some packages in NixOS

    First up, add an ssh server so I can log in easily. That’s as simple as editing configuration.nix to uncomment the line services.openssh.enable = true and then running nixos-rebuild switch to make it happen. The switch takes about 6 seconds. Various docs reference having to configure the firewall or enable password login, I didn’t have to do either with the NixOS GUI install’s defaults. (There’s some magic that allows ssh through the firewall if you enable it.)

    I tried a rollback to remove ssh again. Easy as running nixos-rebuild switch --rollback. Apparently nix is keeping a history of what it’s done that it can undo. The rollback kills the sshd so I can’t log in anymore. My active sshd was still running, so I guess it doesn’t kill everything. (That implies many things about just how much state Nix will alter. Files, not memory, is my guess.)

    Adding other shell utilities just requires listing them in environment.systemPackages. (Or maybe in a user environment, don’t understand that yet.) Whitespace delimited list. I did “joe git zip unzip” for now. And another rebuild switch and now I have those binaries available.

    Final project was a hello world web server. I had Claude do this part for me.

      # Enable Caddy web server
      services.caddy = {
        enable = true;
        configFile = pkgs.writeText "Caddyfile" ''
          :80 {
            root * /var/www
            file_server
          }
        '';
      };
    
      # Create a simple HTML file
      system.activationScripts.createWebRoot = ''
        mkdir -p /var/www
        echo "<html><body><h1>Hello World from NixOS!</h1></body></html>" > /var/www/index.html
      '';
    
      # Allow HTTP traffic
      networking.firewall.allowedTCPPorts = [ 80 ];

    That all just worked, thanks Claude! The firewall change was necessary: without it Caddy was running but unreachable outside this machine. iptables -L confirms that there is a firewall running (the default for NixOS), the chain nixos-fw seems most relevant. It’s letting in only ssh and http and ICMP ping.

    Things to learn about

    • Updating packages from network
    • Scripting creating a new NixOS system from the environment file
    • Per-user Nix config vs system-wide
    • Nix state tracking, how rollbacks work
    • Logging
    • Recommendations for running actual production services in NixOS
    • Pruning disk usage
    • The ecosystem of Nix packages
    • nix-shell
    • WTF a flake is

    Thoughts

    This experiment was a good experience. Over the years I’ve tried a bunch of ways to automate setting up Linux system environments and gotten very frustrated every time. NixOS I had up and running in an hour and I feel like I could use this for real.

    What would I use it for? Not sure, need to think more. I’ve been considering rolling my own NAS, a setup with NFS and Samba and syncthing and restic. I was just going to hand configure it but maybe NixOS would be a nice way to do that in a more managed way. As a VM in Proxmox I guess? Or maybe just an LXC container. Or just as a Nix thing in a general purpose VM, not sure if Nix gives me the separation I’d otherwise use Proxmox for.

    I’ve run into some usability warts. nix search does not work as the config file suggests, you have to do some complicated set of flags for “experimental features” (this is apparently fixable but is a bad experience for a newbie!) The configuration that the NixOS installer created throws various warnings when you build with it. But stuff mostly seems to work.

    The NixOS docs seem pretty good. I haven’t read them much but Claude has and that works great for queries. When I’ve looked stuff up myself there’s clear explanations for what I’m trying to learn.

    I love learning new stuff like this. Claude was a big help.

  • Another failed Ubuntu 23.04 upgrade

    Failed again to upgrade my recalcitrant Ubuntu 23.04 system. I give up and I’m mad at Ubuntu’s EOL policies.

    The fun thing is how much some new tools make this kind of thing easier and safer. I had a lot of confidence experimenting thanks to Proxmox and the ability to easy rollback the whole system or restore a backup. I also got a lot further than I thought I would using Claude. These AIs still have a problem that sometimes it confidentally recommends bullshit. But as long as you understand that and evaluate suggestions yourself it’s a lot faster using an AI to find things than searching ordinary docs and forum posts. I tried more things more quickly than I would have without the AI help.

    Hacking around GRUB

    The problem last time is my system was installed with GRUB referencing a disk in /dev/disk/by-id that no longer exists. When I virtualized the server those IDs got lost. Claude did a great job suggesting various fixes. Some were wrong (like a hallucinated GRUB_DEVICE config) and some were plausible but don’t work (purge and reinstall GRUB). The good suggestion was to use debconf-set-selections to force a reconfig.

    echo "grub-efi grub-efi/install_devices select" | sudo debconf-set-selections
    dpkg-reconfigure grub-efi-amd64

    This forced GRUB to reconfigure the path. It picked /dev/vba1. Not ideal (a UUID path would be better) but should work. And it did! I could reinstall GRUB on my 23.04 system, reboot, the BIOS all seemed happy. Success!

    Maybe. After far too much drama upgrading to 23.10 the system then didn’t reboot. It all looked like it worked but the BIOS was no longer happy with what it found for a boot partition. I chickened out at that point and rolled back, gave up.

    Ubuntu’s hostile EOL policy

    Even though I think I fixed GRUB, I once failed in my attempt to upgrade Ubuntu.

    Ubuntu has a clear but obnoxious policy that outside LTS releases, their interim go end of life quickly. 23.04 went EOL on January 25, 2024. The entire lifespan of this release was 8 months and if you don’t upgrade quickly enough, too bad. I’m a little confused why it won’t still let me update to 23.10: it goes EOL in three days. But when I tried I was told “Your Ubuntu release is not supported anymore”. There’s no option to upgrade directly to 24.04 LTS either.

    I don’t mind that they EOL releases and don’t offer regular patches. But they really should offer some sort of upgrade path. They kind of do, I’ve upgraded an EOL system once before by hacking the sources.list to point at http://old-releases.ubuntu.com/. But there’s absolutely no tool help for doing that and you are on your own.

    This time around I tried just swapping URLs and doing apt full-upgrade. It sure looked like it worked but then the reboot failed, maybe because of a bad update process or maybe because of my weird GRUB situation. Either way, I gave up.

    At this point I’m going to give up and use Ubuntu LTS releases only. Which means one release every two years. I guess I’ll use something with rolling releases (Arch or Debian testing) if I need a more updated OS.

    And now I’m doubly committed to rebuilding my main home Linux server. I kind of punted on that six months ago when I virtualized it into Proxmox as a whole system. I don’t relish rebuilding it all as lots of little containers but it’s probably a good idea. (Oh joy, now I can update 20 little Ubuntu systems instead of one big one!)