krjakbrjak’s Dev Notes

Bootstrapping Kubernetes Before the Registry Exists - Pre-Tagging Images for containerd

2026-05-31T00:00:00+00:00

If you’re setting up Kubernetes for a private project — internal tools, an isolated network, an in-house stack — at some point you hit the question: where do the images come from?

Every tutorial assumes public registries like docker.io, ghcr.io, or quay.io are reachable. When they aren’t, the chicken-and-egg starts. You can’t pull your registry image from your registry. You can’t authenticate against your IdP before the IdP is up. Each foundation service has the same shape.

There isn’t much written about how to actually bootstrap from this state. Here’s the approach I’ve been using.

Foundation services all have the same problem

The same pattern shows up everywhere:

Registry: kubelet needs to pull the registry image from somewhere, but the registry is what would serve it.
Identity provider: anything that does OIDC depends on the IdP being up — and the IdP pod doesn’t start without an image pull either.

If you treat each one as a special case, you end up with a pile of “first time only” scripts that drift out of sync with your normal deploy path.

A workable approach

The mechanic itself is plain:

Build the image with docker build.
Save it to a tarball with docker save.
Copy the tarball to every Kubernetes node.
Import it into containerd with ctr -n k8s.io images import .

With imagePullPolicy: IfNotPresent set on the pod spec, kubelet uses the cached image and doesn’t try to pull. The registry doesn’t have to be up. Nothing has to be reachable.

None of these steps are exotic. What matters is step 2 — specifically, which name you tag the image with before saving.

The naming has to line up

When you import an image into containerd, it ends up in the cache under whatever name the tarball says. That name is just a string. There’s nothing special about repo/registry:latest vs registry.example.com/repo/registry:latest — both are valid, either can live in the cache, neither requires the registry hostname to resolve.

So the question is: which name?

The easy answer is the bare name that matches the chart defaults:

# values.yaml
image:
  repository: repo/registry
  tag: latest
  pullPolicy: IfNotPresent

docker build -t repo/registry:latest .
docker save repo/registry:latest > registry.tar
scp registry.tar node:/tmp/
ssh node 'sudo ctr -n k8s.io images import /tmp/registry.tar'

It works on day 1. But once the registry is up and you start pushing real builds like registry.example.com/repo/registry:v0.4.2, every chart needs its image.repository flipped to the fully-qualified path. Multiple charts, multiple environments, multiple overlays. Day-1 and day-2 deploys end up as different code paths.

The fix is to make the name you import under match the name your chart references and the name kubelet would dial out for. Three things, one string. Use the fully-qualified registry path from day 1:

docker build -t registry.example.com/repo/registry:v0.4.2 .
docker save registry.example.com/repo/registry:v0.4.2 > registry.tar
scp registry.tar node:/tmp/
ssh node 'sudo ctr -n k8s.io images import /tmp/registry.tar'

image:
  repository: registry.example.com/repo/registry
  tag: v0.4.2
  pullPolicy: IfNotPresent

Now containerd’s cache key, the chart’s image.repository, and the hostname kubelet would query on a cache miss are all the same string. The cluster can’t tell whether the image came from a side-load yesterday or a registry pull this morning.

One caveat about the tag itself: use an immutable tag like v0.4.2, not latest. With IfNotPresent, kubelet keeps any image already in the cache and never re-pulls it — so a side-loaded latest stays frozen at the bootstrap build even after the registry is serving a newer latest. An immutable tag sidesteps this: the bootstrap nodes hold v0.4.2 forever (correctly), and the next release ships as v0.4.3, which those nodes have never seen and therefore pull normally once the registry is up.

What this gives you

Every foundation service bootstraps the same way. Registry, IdP — same mechanic, same naming pattern, no special cases. The bootstrap script becomes a loop over a list of images.

Helm values are written once. No bootstrap-mode overlays versus production-mode overlays. No image.repository migration to track later. The values file you ship is the values file that stays correct.

Argo CD inherits the cluster cleanly. When you move to GitOps, Argo CD reads the same Helm charts. The image strings are unchanged — the side-loaded tags are already cached, and every new release bumps to a tag the nodes haven’t seen, so kubelet pulls it from the registry normally. No migration, no first-sync mode, no Application that has to know about the bootstrap path. Day-1 and day-2 are the same code path because the names line up.

Conclusion

The mechanic isn’t the point. docker save and ctr import are not clever. What matters is the alignment: the name in containerd, the name in the chart, and the name kubelet would dial out for — all the same string from the first import.

When they line up, the chicken-and-egg stops feeling like a problem.

When systemd-resolved Picks the Wrong DNS Server

2026-03-17T00:00:00+00:00

In a previous post, I described how I built a DNS forwarder for qcontroller — a tool that manages QEMU VM instances. The forwarder watches the host’s resolv.conf for changes and propagates upstream DNS servers to VMs transparently. It worked great — until I noticed that VMs occasionally failed to resolve private hostnames defined in the host’s /etc/hosts.

The Symptom

The setup was straightforward. Inside each VM, DHCP advertised three DNS servers: the gateway IP (pointing to the forwarder) plus 8.8.8.8 and 1.1.1.1 as fallbacks. From the host, querying the forwarder directly worked fine:

$ dig @192.168.71.1 myserver.internal.corp
;; ANSWER SECTION:
myserver.internal.corp.	0	IN	A	10.0.50.42

But from inside a VM:

$ dig myserver.internal.corp
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)

NXDOMAIN. The VM’s systemd-resolved returned a negative answer — even though the forwarder had the correct one. What was going on?

systemd-resolved Treats All DNS Servers as Equivalent

A quick resolvectl status inside the VM revealed the problem:

Current DNS Server: 1.1.1.1
       DNS Servers: 192.168.71.1 1.1.1.1 8.8.8.8

systemd-resolved had picked 1.1.1.1 as its active server — not the forwarder. And 1.1.1.1 knows nothing about my private /etc/hosts entries.

This is by design. From the systemd-resolved documentation:

The nss-dns resolver maintains little state between subsequent DNS queries, and for each query always talks to the first listed DNS server from /etc/resolv.conf first, and on failure continues with the next until reaching the end of the list which is when the query fails. The resolver in systemd-resolved however maintains state, and will continuously talk to the same server for all queries in a particular lookup scope until some form of error is seen at which point it will switch to the next server, and then stay with it for all queries on the scope until the next failure, and so on, eventually returning to the first configured server. This is done to optimize lookup times, in particular given that the resolver typically must first probe server feature sets when talking to a server, which takes time. This different behaviour implies that listed DNS servers per lookup scope must be equivalent in the zones they serve, so that sending a query to one of them will yield the same results as sending it to another configured DNS server.

In other words: all configured DNS servers within a scope are treated as interchangeable. systemd-resolved picks one, sticks with it, and only rotates on failure. If 1.1.1.1 responds (even with NXDOMAIN), that counts as “working” — so it never bothers trying the forwarder.

The relevant selection logic lives in resolved-dns-scope.c (dns_scope_get_dns_server()) and resolved-dns-server.c (manager_next_dns_server()).

The Fix

The fix is straightforward: advertise only the forwarder’s IP via DHCP, so the VM’s systemd-resolved has no choice but to use it. No public servers in the mix means no wrong server to stick to.

Forwarding to systemd-resolved

But the previous forwarder design had a gap. As described in the earlier post, it read upstream servers from /run/systemd/resolve/resolv.conf — which contains the real upstream DNS servers (like 8.8.8.8), bypassing systemd-resolved entirely. That means the forwarder also bypassed everything systemd-resolved provides: /etc/hosts resolution, mDNS, split-DNS, VPN routing.

What if the forwarder just forwarded to 127.0.0.53 instead?

It turns out this is easy to do. As explained in the network namespaces post, the DNS forwarder runs in the root network namespace — it listens on the gateway IP (the host-side end of the veth pair), which is reachable from the VM namespace. Since it’s in the root namespace, it can talk to 127.0.0.53 directly.

VM query ──► gateway IP:53 (forwarder, root ns) ──► 127.0.0.53 (systemd-resolved)
                                                         │
                                                         ├── /etc/hosts
                                                         ├── /etc/resolv.conf
                                                         ├── mDNS
                                                         ├── VPN split-DNS
                                                         └── ...

The forwarder just needed one small extension — a WithUpstreams option that accepts static upstream addresses instead of reading from a file:

forwarder, err := dns.NewDNSFailoverForwarder(ctx,
    dns.WithForwarderAddress(gatewayIP),
    dns.WithForwarderTimeout(2*time.Second),
    dns.WithUpstreams([]string{"127.0.0.53:53"}),
)

When WithUpstreams is provided, the forwarder stores the addresses directly — no file watching, no fsnotify, no resolv.conf parsing. When it’s not provided, the existing behavior kicks in: watch resolv.conf and update upstreams dynamically.

The configuration is modeled as a protobuf oneof, making the two modes mutually exclusive:

message Dns {
    string zone = 1;
    oneof upstream {
        string resolv_conf = 2;
        StaticUpstreams static = 3;
    }
}

When neither is set, the forwarder falls back to auto-detecting the resolv.conf path — preserving full backward compatibility.

Covering All Cases

This naturally leads to three deployment modes, each covering different environments:

systemd-resolved (most Linux desktops/servers): Use static upstreams pointing to 127.0.0.53. Gets /etc/hosts, mDNS, split-DNS, VPN — everything systemd-resolved handles.

"dns": {
    "zone": ".",
    "static": {
        "endpoints": ["127.0.0.53:53"]
    }
}

Non-systemd with resolv.conf: Use the dynamic resolv.conf watcher. The forwarder picks up upstream changes automatically.

"dns": {
    "zone": ".",
    "resolv_conf": "/etc/resolv.conf"
}

Non-systemd with CoreDNS: For environments where CoreDNS plugins are needed (e.g., the hosts plugin for /etc/hosts support), qcontroller also supports an embedded CoreDNS backend:

server, err := dns.NewCoreDNSServer(ctx,
    dns.WithForwarderAddress(gatewayIP),
    dns.WithResolvconfPath("/etc/resolv.conf"),
)

Conclusion

The root cause came down to a design assumption in systemd-resolved: all configured DNS servers must be equivalent. When some know about private resources and others don’t, things break in subtle, hard-to-debug ways.

The fix turned out to be small. The forwarder already ran in the root namespace, so 127.0.0.53 was right there. Adding a WithUpstreams option and a oneof in the protobuf schema was enough to make it work. VMs get full host DNS resolution — /etc/hosts, VPN, mDNS — without touching their configuration.

Giving Your AI the Right Context with Model Context Protocol (MCP)

2026-03-09T00:00:00+00:00

Nowadays, pretty much everyone works with AI one way or another. Whether it’s writing code, debugging, designing infrastructure — LLMs have pushed our productivity to yet another level. But here’s the thing: in order to utilize their power more efficiently, we can actually help them be more efficient. That’s where the Model Context Protocol (MCP) comes in — and in this post, I’ll show how to build a simple MCP server in Go.

The Problem

Say you’re working on a backend. You have your data — maybe a database, maybe an API — and you want to think about what kind of interface or tooling you could build around it. You open your favorite AI assistant and start describing your data structures. “I have a books table with id, title, author, and a loans table that references…” — you get the idea. It works, but it’s tedious. You’re essentially doing the model’s homework.

Why not just let it look at the data directly?

For a small service without authentication, sure — you could point it at an endpoint and say “fetch some data, see its structure.” But imagine you’re working on something bigger. Something behind authentication layers, internal APIs, complex data relationships. You can’t just hand the model a URL and hope for the best.

Instead, you could build a small application that sits between the model and your backend — something that knows how to pull the data and explain its shape to the model. The model calls your app, your app talks to the backend, and the model gets exactly the context it needs.

You get the idea. If only there was a standard protocol for this…

What is Model Context Protocol (MCP)?

It’s called the Model Context Protocol (MCP), designed by Anthropic. And it’s exactly what you’d want.

MCP defines a standard way for AI models to discover and call external tools. Your app becomes an MCP server — it advertises what it can do (search, fetch, create, whatever), and the model calls those tools when it needs context. No more manual copy-pasting. No more describing your data schema in a chat window.

A Simple Example: Library Catalog

To see how this works in practice, I built a tiny MCP server in Go — a library catalog. Two tools: search books and get book details.

Here’s the MCP server configuration (.mcp.json), placed in the root of your workspace. This file defines MCP servers that provide additional context and capabilities to the AI client (e.g. Claude Code):

{
  "mcpServers": {
    "library": {
      "command": "./mcp-library",
      "args": []
    }
  }
}

And the server itself — using the official Go SDK:

server := mcp.NewServer(&mcp.Implementation{
    Name:    "library-mcp",
    Version: "1.0.0",
}, nil)

server.AddTool(
    &mcp.Tool{
        Name:        "search_books",
        Description: "Search the library catalog. Returns matching books.",
        InputSchema: json.RawMessage(`{
            "type": "object",
            "properties": {
                "title":  {"type": "string", "description": "Filter by book title."},
                "author": {"type": "string", "description": "Filter by author name."}
            }
        }`),
    },
    func(ctx context.Context, req *mcp.CallToolRequest) (*mcp.CallToolResult, error) {
        args := parseArgs(req)
        result, err := searchBooks(ctx, str(args, "title"), str(args, "author"))
        return textResult(result, err)
    },
)

server.AddTool(
    &mcp.Tool{
        Name:        "get_book",
        Description: "Get full details for a book by its ID, including loan history.",
        InputSchema: json.RawMessage(`{
            "type": "object",
            "properties": {
                "id": {"type": "string", "description": "The book ID."}
            },
            "required": ["id"]
        }`),
    },
    func(ctx context.Context, req *mcp.CallToolRequest) (*mcp.CallToolResult, error) {
        args := parseArgs(req)
        result, err := getBook(ctx, str(args, "id"))
        return textResult(result, err)
    },
)

if err := server.Run(context.Background(), &mcp.StdioTransport{}); err != nil {
    log.Fatal(err)
}

Each tool declares its name, description, and input schema — this is what the model sees. When the model decides it needs to search for books by Tolkien, it calls search_books with {"author": "Tolkien"}. The server does the lookup and returns the results. The model never had to be told what the data looks like — it discovered and queried it on its own.

What This Looks Like in Practice

With this MCP server running, I can open Claude Code in my project directory and just have a conversation:

Are there any books by Tolkien?

The model picks up the search_books tool, calls it with the right filter, and comes back with the results. No prompting gymnastics. No pasting JSON blobs. It just works.

And the best part — this is a trivial example with hardcoded data. Replace the mock data with actual database queries or API calls to your production backend, and you’ve got yourself an AI assistant that truly understands your system.

Conclusion

MCP bridges the gap between what the model can do and what it knows about your specific context. Instead of explaining your world to the model, you give it the tools to explore it. The protocol is open, the SDKs are available in multiple languages, and the integration with tools like Claude Code is already there.

If you’re building anything where an LLM could benefit from knowing your data — and let’s be honest, that’s most things these days — MCP is worth looking into.

The full source code for this example is available here.

Writing a BPF packet filter on macOS in Go

2026-02-19T00:00:00+00:00

 Without filter                 With BPF filter

  Network     Userspace          Network     Userspace
 ┌───────┐   ┌─────────┐       ┌───────┐   ┌─────────┐
 │  ARP  │──→│  ARP    │       │  ARP  │──→│  ARP    │
 │  IPv4 │──→│  IPv4   │       │  IPv4 │   │  reply  │
 │  ARP  │──→│  ARP    │       │  ARP  │   │         │
 │  IPv6 │──→│  IPv6   │       │  IPv6 │   │         │
 │  IPv4 │──→│  IPv4   │       │  IPv4 │   │         │
 │  ARP  │──→│  ARP    │       │  ARP  │   │         │
 │  ...  │──→│  ...    │       │  ...  │   │         │
 └───────┘   └─────────┘       └───────┘   └─────────┘
  ~10,000     ~10,000            ~10,000       ~100
  packets     copied             packets      copied

  App filters in userspace       Kernel filters before copy

The problem: discovering VM IP addresses without a guest agent

In a recent change to qcontroller, I removed the dependency on QEMU Guest Agent (QGA) for discovering a VM’s IP address. Previously, users had to install QGA inside every VM—easy enough with cloud-init, but still a hard requirement just to answer the question “what IP did this VM get?”

The alternative: ARP scanning. I already control the MAC addresses assigned to VMs, so I can periodically broadcast ARP requests on the virtual network interface and match the replies against known MACs. Pure Layer 2, no guest cooperation needed.

This post isn’t about the ARP scanner itself (that’s in PR #26). It’s about a problem I hit on macOS, and how six lines of BPF bytecode solved it. The BPF filter is implemented in PR #27.

Raw sockets on macOS: there aren’t any

On Linux, you open an AF_PACKET socket, bind it to an interface, and you’re reading raw Ethernet frames. macOS doesn’t support AF_PACKET. Instead, you go through BPF—Berkeley Packet Filter.

The setup looks roughly like this:

Open /dev/bpf0 (or /dev/bpf1, /dev/bpf2, … — you try them until one is available)
Bind it to a network interface with BIOCSETIF
Enable immediate mode with BIOCIMMEDIATE so reads return as soon as a packet arrives, rather than waiting for the buffer to fill
Optionally enable promiscuous mode with BIOCPROMISC
Read from the file descriptor—you get raw Ethernet frames, each prefixed by a bpf_hdr struct

This works. But there’s a catch.

The flood

Promiscuous mode means the BPF device captures everything on the wire—not just frames addressed to your MAC. On my home network, which has maybe a dozen devices, a few seconds of capture produced roughly:

~9,400 ARP frames (requests and replies from all devices)
~190 IPv4 frames
~50 IPv6 frames

That’s about 10,000 frames copied from kernel to userspace, where my Go code then checks each one: is it ARP? Is it a reply? Does the sender MAC match a VM I care about? For 99% of those frames, the answer is no.

On a busier network—an office, a data center—this gets much worse. We’re doing an O(n) scan of the entire network’s chatter to find the handful of ARP replies we actually need. The kernel already has all these frames in its buffers; we’re just making it copy them all to us so we can throw most away.

BPF is more than a packet source

Here’s the coolest thing about BPF: it’s not just a mechanism for reading packets. It includes a programmable filter that runs inside the kernel, before packets are copied to userspace. The “F” in BPF stands for Filter, and that filter is the interesting part.

BPF defines a small virtual machine with:

Two registers: A (accumulator) and X (index), both 32-bit
A small instruction set: load, store, jump, arithmetic, return

The VM operates on the raw packet data. Instructions can load bytes from specific offsets in the packet, compare them, and either accept or reject the packet. The kernel runs this program on every incoming frame. Only frames that pass the filter get copied to userspace.

This is the same mechanism that powers tcpdump expressions. When you write tcpdump arp, tcpdump compiles that into BPF bytecode and installs it via BIOCSETF. We can do the same thing.

The Ethernet frame layout

To write a BPF filter, you need to know exactly what bytes you’re looking at. An Ethernet frame carrying an ARP message is 42 bytes:

Ethernet header (14 bytes):
  [0:6]   Destination MAC (broadcast: ff:ff:ff:ff:ff:ff)
  [6:12]  Source MAC
  [12:14] EtherType         ← 0x0806 means ARP

ARP payload (28 bytes):
  [14:16] Hardware type      (1 = Ethernet)
  [16:18] Protocol type      (0x0800 = IPv4)
  [18]    Hardware addr len   (6)
  [19]    Protocol addr len   (4)
  [20:22] Operation          ← 1 = request, 2 = reply
  [22:28] Sender MAC
  [28:32] Sender IP
  [32:38] Target MAC
  [38:42] Target IP

Two fields matter for filtering:

Byte offset 12 (2 bytes): the EtherType. If it’s not 0x0806, this isn’t ARP—drop it.
Byte offset 20 (2 bytes): the ARP opcode. If it’s not 0x0002, this isn’t a reply—drop it.

The filter: six instructions

Here’s the complete BPF program using the BPF_STMT/BPF_JUMP macros from the bpf(4) man page:

BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),            // A = halfword at offset 12 (EtherType)
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x0806, 0, 3), // if A == 0x0806 (ARP) continue, else skip 3 to drop
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),            // A = halfword at offset 20 (ARP opcode)
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x0002, 0, 1), // if A == 0x0002 (reply) continue, else skip 1 to drop
BPF_STMT(BPF_RET+BPF_K, 0xFFFFFFFF),            // ACCEPT: return entire packet
BPF_STMT(BPF_RET+BPF_K, 0),                     // DROP: return 0 bytes (discard)

BPF_STMT(code, k) encodes a non-branching instruction. BPF_JUMP(code, k, jt, jf) encodes a conditional branch where jt and jf are the number of instructions to skip forward on true/false. The code field is built by combining a class (BPF_LD, BPF_JMP, BPF_RET), a size (BPF_H for halfword—2 bytes), and an addressing mode (BPF_ABS for absolute packet offset, BPF_K for constant).

A BPF_RET instruction tells the kernel how many bytes of the packet to copy to userspace. Returning 0xFFFFFFFF (the maximum uint32) means “copy the entire packet.” Returning 0 means “copy nothing”—i.e., drop the packet.

BPF_JUMP takes two skip counts: jt (jump true) and jf (jump false). A skip of 0 means “don’t skip, just execute the next instruction”—sometimes called falling through. A skip of 3 means “skip the next 3 instructions.”

Let’s trace through what happens for different packets:

An ARP reply arrives. BPF_LD loads bytes [12:14] into A: 0x0806. BPF_JEQ compares against 0x0806: match, jt=0, so we fall through. Next BPF_LD loads bytes [20:22]: 0x0002. BPF_JEQ compares against 0x0002: match, fall through. BPF_RET returns 0xFFFFFFFF—the kernel copies the full packet to userspace.

An ARP request arrives. Same path through the first three instructions, but bytes [20:22] contain 0x0001 (request, not reply). BPF_JEQ: no match, jf=1, skip 1 instruction forward—past the accept—landing on BPF_RET returning 0. Packet dropped. Never reaches userspace.

An IPv4 packet arrives. BPF_LD loads bytes [12:14]: 0x0800. BPF_JEQ against 0x0806: no match, jf=3, skip 3 instructions forward, landing directly on the drop. Two instructions and it’s done. The kernel never even looks at the ARP opcode field.

Most traffic on a network is IPv4/IPv6, and it gets rejected after just two instructions—a load and a conditional jump. The kernel doesn’t copy a single byte to userspace for those packets.

Writing it in Go

Go’s syscall package has BpfStmt and BpfJump functions for constructing BPF instructions, but they’re deprecated. The recommended replacement is golang.org/x/net/bpf, which provides typed instruction structs:

var arpReplyFilter = []bpf.Instruction{
    bpf.LoadAbsolute{Off: 12, Size: 2},                          // BPF_LD+BPF_H+BPF_ABS  k=12
    bpf.JumpIf{Cond: bpf.JumpEqual, Val: 0x0806, SkipFalse: 3}, // BPF_JMP+BPF_JEQ+BPF_K k=0x0806 jt=0 jf=3
    bpf.LoadAbsolute{Off: 20, Size: 2},                          // BPF_LD+BPF_H+BPF_ABS  k=20
    bpf.JumpIf{Cond: bpf.JumpEqual, Val: 0x0002, SkipFalse: 1}, // BPF_JMP+BPF_JEQ+BPF_K k=0x0002 jt=0 jf=1
    bpf.RetConstant{Val: 0xFFFFFFFF},                            // BPF_RET+BPF_K          k=0xFFFFFFFF
    bpf.RetConstant{Val: 0},                                     // BPF_RET+BPF_K          k=0
}

Each Go struct maps directly to a BPF instruction. LoadAbsolute{Off: 12, Size: 2} is BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12)—load a halfword (2 bytes) from absolute packet offset 12. JumpIf{Cond: bpf.JumpEqual, Val: 0x0806, SkipFalse: 3} is BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 0x0806, 0, 3)—SkipFalse: 3 means “if not equal, skip 3 instructions forward” (landing on the final BPF_RET that drops the packet).

The bpf.Assemble function compiles these typed instructions into raw bytecode ([]bpf.RawInstruction). But here’s where it gets interesting: golang.org/x/net/bpf doesn’t provide a function to install the filter on a macOS BPF device. It does for Linux sockets (SO_ATTACH_FILTER), but the macOS BIOCSETF ioctl needs a syscall.BpfProgram struct pointing to syscall.BpfInsn values. Fortunately, bpf.RawInstruction and syscall.BpfInsn have identical memory layouts—both are {Op uint16, Jt uint8, Jf uint8, K uint32}—so an unsafe.Pointer cast works:

func setBPFFilterARPReply(fd int) error {
    raw, err := bpf.Assemble(arpReplyFilter)
    if err != nil {
        return fmt.Errorf("failed to assemble BPF filter: %w", err)
    }

    prog := syscall.BpfProgram{
        Len:   uint32(len(raw)),
        Insns: (*syscall.BpfInsn)(unsafe.Pointer(&raw[0])),
    }
    _, _, errno := syscall.Syscall(
        syscall.SYS_IOCTL,
        uintptr(fd),
        syscall.BIOCSETF,
        uintptr(unsafe.Pointer(&prog)),
    )
    if errno != 0 {
        return fmt.Errorf("BIOCSETF failed: %v", errno)
    }
    return nil
}

Testing without hardware

One of the nice things about golang.org/x/net/bpf is that it includes bpf.NewVM, a userspace BPF interpreter. You can feed it your filter program and run arbitrary byte slices through it to verify the accept/drop logic without opening any devices or network interfaces:

func TestARPReplyFilter_DropsARPRequest(t *testing.T) {
    vm, err := bpf.NewVM(arpReplyFilter)
    require.NoError(t, err)

    frame := buildARPRequest(
        net.HardwareAddr{0x11, 0x22, 0x33, 0x44, 0x55, 0x66},
        net.IP{10, 0, 0, 1},
        net.IP{10, 0, 0, 2},
    )
    verdict, err := vm.Run(frame)
    require.NoError(t, err)
    assert.Zero(t, verdict, "ARP request should be dropped")
}

vm.Run returns the number of bytes the filter would accept. Zero means drop. This makes BPF filter logic fully unit-testable—no root privileges, no network interfaces, no platform dependencies.

The result

Before the filter, with debug logging enabled to count frames:

Received frame: 0x0806
Received frame: 0x0800
Received frame: 0x0806
Received frame: 0x86dd
Received frame: 0x0800
...
(~10,000 frames in a few seconds)

0x0806 is ARP, 0x0800 is IPv4, 0x86dd is IPv6—all mixed together, all copied to userspace.

After installing the filter:

Received frame: 0x0806
Received frame: 0x0806
Received frame: 0x0806
...
(~100 frames in a few seconds)

Only 0x0806. Only ARP replies. A ~100x reduction in packets reaching userspace, achieved by six instructions running in the kernel. The CPU and memory cost of processing those extra 9,900 frames per scan cycle is simply gone.

Beyond ARP: other things you can filter

The same pattern applies any time you want to isolate a specific type of traffic. A BPF filter is just a sequence of field checks at fixed byte offsets — once you know the layout of the packet you’re after, writing the filter is mechanical. A few examples:

HTTP/HTTPS traffic (custom sniffer for a specific service). Three layers: EtherType 0x0800 at offset 12, IP protocol 0x06 (TCP) at offset 23, TCP destination port at offset 36. Matching two ports requires two JumpIf instructions — the first jumps to accept on port 80, the second drops anything that isn’t 443:

var httpFilter = []bpf.Instruction{
    bpf.LoadAbsolute{Off: 12, Size: 2},
    bpf.JumpIf{Cond: bpf.JumpEqual, Val: 0x0800, SkipFalse: 6}, // IPv4?
    bpf.LoadAbsolute{Off: 23, Size: 1},
    bpf.JumpIf{Cond: bpf.JumpEqual, Val: 0x06, SkipFalse: 4},   // TCP?
    bpf.LoadAbsolute{Off: 36, Size: 2},                          // dst port
    bpf.JumpIf{Cond: bpf.JumpEqual, Val: 80, SkipTrue: 1},      // port 80 → accept
    bpf.JumpIf{Cond: bpf.JumpEqual, Val: 443, SkipFalse: 1},    // port 443 → accept
    bpf.RetConstant{Val: 0xFFFFFFFF},
    bpf.RetConstant{Val: 0},
}

This assumes a standard 20-byte IP header. It also only matches the destination port — outgoing requests. To catch responses too, add the same OR check against the source port at offset 34.

ICMP only (ping traffic, latency tooling). Check EtherType 0x0800 at offset 12, then load the IP protocol byte at offset 23 and compare to 0x01. Two checks — done.

DNS (queries and replies). EtherType 0x0800 at offset 12, IP protocol 0x11 (UDP) at offset 23, then the 2-byte UDP destination port at offset 36 equal to 0x0035 (53). Three checks; everything else is gone before it reaches your code.

DHCP (watching address assignments on a local network). Same shape as DNS — EtherType 0x0800, UDP — but match destination port 0x0043 (67, server) or 0x0044 (68, client).

Traffic from a specific MAC address. The source MAC sits at offsets 6–11 in the Ethernet header. Load 4 bytes at offset 6, compare to the upper 32 bits of the target MAC; load 2 bytes at offset 10, compare to the lower 16 bits. Two checks, no IP layer involved.

The principle is always the same: find the fixed-offset fields that uniquely identify the traffic you want, put the most common rejection first, and jump to the drop on mismatch. The kernel handles the rest.

Takeaway

If you’re doing any kind of raw packet capture on macOS through /dev/bpf*, installing a filter is straightforward and the performance difference is dramatic. Six instructions, two conditional checks, and the kernel does the work for you.

One constraint worth knowing: classic BPF on macOS is read-only. You can observe and filter packets, but you cannot modify or inject them. If that’s a requirement, you’ll need a different approach.

Solving Keycloak Internal vs External Access in Kubernetes with hostname-backchannel-dynamic

2026-02-16T00:00:00+00:00

Introduction

Using OpenID Connect (OIDC) as an authentication source is one of the best practices when working with infrastructure, as it significantly improves both security and maintainability. Keycloak is an excellent open-source project widely adopted for this purpose. It supports many features and storage backends (such as PostgreSQL) and has straightforward deployment instructions on their official website.

However, I recently encountered an interesting challenge when deploying Keycloak in Kubernetes that required a specific configuration to solve internal service communication issues.

The Problem: External Hostname vs Internal Access

When deploying Keycloak in Kubernetes, you typically specify a public hostname using the --hostname=https://auth.example.com parameter. This works perfectly for external clients accessing your authentication service.

But here’s where it gets tricky: imagine you have other services running in your Kubernetes cluster—perhaps a container registry or CI server—that need to authenticate with Keycloak. These services need to access the discovery URL at https://auth.example.com/realms/{realm-name}/.well-known/openid-configuration to retrieve authentication configuration.

The issue arises because Keycloak internally always redirects to (and generates tokens/URLs based on) the hostname that was specified during deployment. But what happens when this public URL is not resolvable by pods inside the Kubernetes cluster? This creates a problem where internal services can’t properly reach Keycloak for backchannel requests (token introspection, userinfo, etc.), even if they can reach the pod via internal DNS.

The Solution: Dynamic Backchannel Hostname

Fortunately, Keycloak provides a CLI option to address this exact issue (available when the hostname:v2 feature is enabled):

--features=hostname:v2
--hostname-backchannel-dynamic=true

This configuration tells Keycloak to dynamically determine the backchannel (internal) URLs based on the incoming request, allowing access via:

Direct IP addresses
Internal Kubernetes DNS (e.g., keycloak.keycloak-namespace.svc.cluster.local:8080/realms/{realm-name}/.well-known/openid-configuration)

How It Works

With --hostname-backchannel-dynamic=true enabled:

External Access: Clients outside the cluster use the public hostname (https://auth.example.com) for authentication flows.
Internal Access: Services within the cluster can use the internal Kubernetes service DNS name to communicate directly with Keycloak pods.

This dual-access approach ensures that:

External clients get the proper public URL for authentication flows
Internal services can reliably reach Keycloak using cluster-internal DNS resolution
No complex network routing or additional ingress configuration is needed just for internal communication

Note for production: For this to work securely, make sure your ingress / reverse proxy correctly passes Forwarded or X-Forwarded-* headers, and consider enabling HTTPS on both external and internal access paths.

Example Configuration

Here’s how you might configure this in a Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: keycloak
spec:
  template:
    spec:
      containers:
      - name: keycloak
        image: quay.io/keycloak/keycloak:latest
        args:
        - start
        - --features=hostname:v2           # required for dynamic backchannel
        - --hostname=https://auth.example.com
        - --hostname-backchannel-dynamic=true
        - --db=postgres
        - --proxy-headers=forwarded        # important for correct header handling behind proxy/ingress
        # ... other configuration (ports, HTTPS, DB credentials via env vars, etc.)

Conclusion

The –hostname-backchannel-dynamic=true flag (combined with the hostname:v2 feature) is a simple yet powerful solution for mixed internal/external access scenarios in Kubernetes. While the public URL remains ideal for external client access, internal service-to-service communication often requires this flexibility.

Keycloak’s hostname configuration options make it a robust choice for authentication infrastructure in containerized environments.

References

Building a Simple DNS Forwarder for VMs in Go

2026-01-30T00:00:00+00:00

Learn how to build a smart DNS forwarder in Go for QEMU VMs managed by qcontroller. Automatically sync host DNS (including VPN changes) using fsnotify, miekg/dns, and CoreDHCP — without touching running guest configurations.

Introduction: Why DNS “Just Works” … Until It Doesn’t

On modern Linux systems, systemd-resolved handles DNS resolution transparently — you rarely need to think about it. It simply works. But when managing QEMU-based virtual machines with qcontroller, things get more interesting. qcontroller supports two main ways to configure networking and DNS for VM instances:

DHCP (default fallback)
Cloud-Init network configuration

When Cloud-Init’s network config is not used, it falls back to DHCP. As explained in the previous post, qcontroller runs the QEMU process inside a dedicated network namespace connected to the host’s root namespace via a veth pair. This namespace isolation is powerful: port 53 (DNS) is free inside the namespace, so we can run our own DHCP and DNS services without conflicts. For DHCP, I use the excellent, modular CoreDHCP server — embedded and running in a separate goroutine. One of its key configuration fields is the DNS server IP (DHCP clients always query DNS on port 53). I simply pass the nameserver IPs from the QEMU subcommand configuration:

    "linuxSettings": {
        "network": {
            "name": "br0",
            "gateway_ip": "192.168.71.1/24",
            "bridge_ip": "192.168.71.3/24",
            "dhcp": {
                "start": "192.168.71.4/24",
                "end": "192.168.71.254/24",
                "lease_time": 86400,
                "dns": ["8.8.8.8", "8.8.4.4"],
                "lease_file": "./build/run/qcontroller-dhcp-leases"
            },
            "start_dns": true
        }
    }

This configuration will start the internal DNS server and use the IPs specified in the dns field as fallback DNS resolvers.

When static IPs are preferred, you can provide Cloud-Init network config with dedicated nameservers. This setup is reliable: start the VM, and everything configures itself automatically. I thought my work was done — until I connected the host to a VPN. Suddenly, DNS resolution for resources in the VPN subnet stopped working inside the VMs.

The Two Core Problems

Detecting host DNS changes (e.g., new VPN nameservers added to the host)
Propagating those changes to running VMs without disrupting or compromising guest services

Touching running VMs directly is dangerous — a mistake could break critical services. We need a safer approach.

Solution Part 1: Detecting Host DNS Changes Reliably

On Linux, nameservers are traditionally listed in /etc/resolv.conf. But on systemd-based systems, /etc/resolv.conf is usually a symlink to a stub file pointing to 127.0.0.53 (systemd-resolved’s local resolver). The real upstream servers are managed elsewhere.

The correct location is:

/run/systemd/resolve/resolv.conf (on systemd systems)
/etc/resolv.conf (fallback for non-systemd setups)

Because qcontroller runs in a separate network namespace, we can still access these host files via the namespace setup. Polling the file works but wastes resources. Better: watch for changes using filesystem notifications. In Go, the battle-tested fsnotify library handles this perfectly. For maximum reliability (especially with systemd’s atomic renames), watch the parent directory (/run/systemd/resolve/ or /etc/) instead of the file itself. This captures creates, removes, and modifications cleanly.

Solution Part 2: Parsing resolv.conf Without Reinventing the Wheel

Once a change is detected, parse the file to extract upstream servers. Parsing resolv.conf manually is doable but error-prone and best avoided. Instead, use the mature miekg/dns library — the de-facto standard DNS toolkit in Go. It includes built-in parsers:

import "github.com/miekg/dns"

upstreams := []string{}
cfg, cfgErr := dns.ClientConfigFromFile("/run/systemd/resolve/resolv.conf")
if cfgErr != nil {
    // fallback to /etc/resolv.conf
    cfg, cfgErr = dns.ClientConfigFromFile("/etc/resolv.conf")
}

if cfgErr == nil {
  for _, server := range cfg.Servers {
    upstreams = append(upstreams, net.JoinHostPort(server, cfg.Port))
  }
}

// upstreams now contains the upstream addresses

With fsnotify + miekg/dns, we reliably detect and load updated upstreams from the host.

Solution Part 3: Static DNS in VMs + Smart Forwarding

Instead of dynamically reconfiguring VMs (risky!), give every VM a single, static DNS resolver IP — the address of our embedded DNS server inside the namespace. But how can one static resolver handle host DNS changes (VPNs, etc.)? Enter a custom DNS forwarder:

Listens on port 53 in the VM namespace
Forwards queries sequentially to the current upstream list (from host resolv.conf)
Returns immediately on the first positive response (NOERROR + answers > 0)
Otherwise continues to the next upstream
Falls back to the last negative response (e.g. NXDOMAIN or NODATA)
Returns SERVFAIL only if all upstreams fail completely (network errors)

This “optimistic fallback until positive” logic is simple yet powerful — it mirrors real-world needs like VPN + public DNS chaining. The full implementation lives in qcontroller — see the latest changes.

Fallback for Resilience

What happens if qcontroller crashes (hopefully not the case!) or stops? VMs keep running, but DNS updates from the host stop. To handle this gracefully, configure a fallback nameserver list in the QEMU config (e.g., 8.8.8.8, 1.1.1.1, 9.9.9.9). VMs then fall back to public DNS — not ideal for internal/VPN resources, but better than total failure.

Conclusion

With this setup:

VMs always use a single, static DNS IP
The embedded forwarder dynamically follows host DNS changes (including VPN connections)
No guest reconfiguration needed → zero risk to running services
Reliable detection via fsnotify + robust parsing via miekg/dns
Graceful fallback via configurable public resolvers

Your VMs now have the exact same network connectivity as the host root namespace — automatically.

Enjoy hassle-free DNS in your VM fleet!

From Swagger UI to React: Building qcontroller’s Frontend

2026-01-08T00:00:00+00:00

In previous articles, I introduced qcontroller, a powerful tool for managing the complete lifecycle of QEMU VM instances—creating, starting, stopping, and removing VMs with database-like operations.

While qcontroller’s REST API worked well for automation, and Swagger UI provided basic interaction capabilities, the growing adoption revealed a critical pain point: managing VMs through Swagger UI was becoming increasingly tedious for daily operations. What started as a backend-focused project clearly needed a proper frontend.

I built the qcontroller UI—a React-based web interface that transforms VM management from a technical chore into an intuitive experience. After spending considerable time on infrastructure and backend development, returning to frontend work was a refreshing change that reminded me why I love building user-facing applications.

TL;DR

Built a React frontend for qcontroller to replace cumbersome Swagger UI. Key highlights:

Tech stack: React + TypeScript + Mantine + Vite for modern, maintainable development
Real-time updates: WebSocket integration for live VM status changes and IP allocation
Code generation: OpenAPI Generator for REST client + Protocol Buffers for WebSocket messages
Single binary distribution: Go’s embed directive bundles the entire React app into the executable
Result: Users download one file and get both API and web interface with zero setup

The Challenge: Beyond Basic CRUD Operations

The UI requirements seemed straightforward at first glance, but the devil was in the details. VM management operations naturally split into two domains:

VM Image Management:

Upload custom VM images (crucial for development workflows)
List available images with metadata
Remove unused images to save storage

VM Instance Lifecycle:

Create instances with complex configuration options
Start, stop, and delete VMs
Monitor real-time status changes
Track resource allocation (IP addresses, ports, etc.)

The real complexity emerged from the parameters involved. Creating a VM isn’t just clicking “start”—it involves networking configurations, resource allocation, storage options, and more. Each operation needed a thoughtful UI that could handle this complexity without overwhelming users.

The Game Changer: Real-Time Updates

The most critical missing piece was live feedback. In the Swagger UI world, you’d make a request and manually refresh to see status changes. But VM operations are inherently asynchronous—starting a VM takes time, IP allocation happens dynamically, and status changes occur continuously.

This drove me to implement WebSocket-based event streaming in qcontroller itself. Now the UI could show real-time updates as VMs boot up, IP addresses get assigned, and operations complete. This single feature transformed the user experience from static and frustrating to dynamic and responsive.

Tech Stack Decisions: Modern Tools for Modern Problems

Choosing the right frontend stack was crucial for both development speed and long-term maintainability.

React + TypeScript: The obvious choice for component-based UI development. React’s virtual DOM model and extensive ecosystem made it perfect for building dynamic interfaces.

Mantine: After evaluating several component libraries, Mantine stood out for its high-quality, responsive components and excellent developer experience. Every component looked professional out of the box—crucial for a developer tool that needed to feel polished.

Vite: Modern build tooling that feels lightning-fast compared to Webpack. The development server starts instantly, and hot module replacement actually works reliably.

The real elegance came from React’s Context API for handling WebSocket connections. Instead of prop drilling or complex state management, the entire app could reactively update from a single WebSocket stream:

import { useEffect, useState } from 'react';
import { UpdatesContext } from '@/common/updates-context';

export function UpdatesProvider({ children, wsUrl }) {
  const [data, setData] = useState(null);

  useEffect(() => {
    const ws = new WebSocket(wsUrl);
    ws.binaryType = 'arraybuffer';

    ws.onopen = () => {
      // Implementation
    };

    ws.onmessage = (event) => {
      // Implementation
    };

    return () => {
      if (ws.readyState === WebSocket.OPEN) {
        ws.close();
      }
    };
  }, [wsUrl]);

  return (
    <UpdatesContext.Provider value={data}>{children}</UpdatesContext.Provider>
  );
}

And then, in your app entry point:

import React from 'react';
import ReactDOM from 'react-dom/client';
import { UpdatesProvider } from '@/common/updates-provider';

ReactDOM.createRoot(document.getElementById('root')!).render(
  <React.StrictMode>
    <UpdatesProvider wsUrl="/ws">
      {/* App content */}
    </UpdatesProvider>
  </React.StrictMode>
);

Code Generation: The API-First Approach

For the REST API communication, I leveraged OpenAPI Generator to automatically generate TypeScript client code. This API-first approach eliminates the common frontend-backend synchronization problems and ensures type safety across the entire stack.

The async nature of VM operations presented an interesting challenge. While OpenAPI excels at describing synchronous REST operations, there’s no standard way to describe WebSocket-based event streams. AsyncAPI exists but didn’t fit my specific needs, and I wanted to avoid the complexity of gRPC-Web proxies.

The solution was surprisingly elegant: using Protocol Buffers for WebSocket messages. With ts-proto, the WebSocket message handling became as type-safe as the REST API, with everything generated from .proto definitions. Only a few lines of WebSocket connection code needed to be written manually—the rest was generated and type-safe.

The Deployment Game-Changer: Single Binary with Embedded UI

One of the most compelling aspects of this project turned out to be the deployment strategy. For qcontroller’s specific use case—a tool that gets distributed as a standalone binary—this approach was a perfect match.

qcontroller is written in Go, which already provides excellent deployment characteristics: compile once, run anywhere, no runtime dependencies. Since the tool is designed to be downloaded and run directly by users, maintaining that simplicity was crucial. But how do you include a modern React application without breaking this elegant distribution model?

For most web applications, you’d have separate frontend and backend deployments, CDNs for static assets, or containerized solutions. But qcontroller needed to stay true to its “single binary” philosophy for easy adoption and maintenance.

Go’s embed directive provided the perfect solution for this specific requirement—the entire React build becomes part of the binary itself:

package frontend

import (
  "embed"
  "net/http"
)

//go:embed generated/*
var webFS embed.FS

func Handler(basepath string) http.HandlerFunc {
  return func(w http.ResponseWriter, r *http.Request) {
    path := "generated/" + r.URL.Path[len(basepath):]
    if _, err := webFS.Open(path); err != nil {
      // Serve index.html for client-side routing
      http.ServeFileFS(w, r, webFS, "generated/index.html")
      return
    }
    http.ServeFileFS(w, r, webFS, path)
  }
}

For qcontroller’s distribution model, this delivers exactly what’s needed: users download one binary, run it, and immediately get both the API and UI. No configuration files, no separate setup steps, no version mismatches between frontend and backend components.

The maintenance benefits are significant too. There’s no need to coordinate releases between multiple services, no asset versioning concerns, and no deployment complexity. Users always get a perfectly matched frontend and backend in a single download.

Results: From Functional to Delightful

The transformation from Swagger UI to a custom React interface has been remarkable. What was once a series of API calls requiring manual status checks is now an intuitive dashboard with real-time updates. VM creation involves guided forms instead of raw JSON, and operations provide immediate visual feedback.

The development experience reinforced something I’ve always believed: when you choose the right tools, frontend development can be just as systematic and maintainable as backend work. The combination of TypeScript, code generation, and well-designed component libraries created a development workflow that felt as robust as my usual Go projects.

The qcontroller UI proves that developer tools don’t have to sacrifice usability for power. With the right architecture and toolchain, you can build interfaces that are both technically sophisticated and genuinely pleasant to use.

Network Namespaces: Isolating VM Networking

2025-11-29T00:00:00+00:00

In my previous articles, I discussed various networking approaches for Linux virtualization. I developed qcontroller, a tool responsible for managing the complete lifecycle of QEMU VM instances—creating, starting, stopping, and removing VMs with database-like operations.

Since modern VMs typically require internet access and inter-VM communication, qcontroller also manages firewall settings using nftables rules. The original networking scheme involved creating bridges, configuring nftables chains, and establishing rules to allow traffic flow between the internet, VMs, and host system. Each VM connects through a TAP device that uses the bridge as its master interface.

While this approach works well, it has a significant drawback: all networking components—bridges, TAP devices, and nftables rules—exist within the host’s network stack. This “pollution” of the host networking requires careful cleanup to avoid breaking the host system when removing VMs. Each interface and rule must be individually and properly removed.

I prefer solutions where removing a single component automatically cleans up everything else. Fortunately, Linux provides exactly this capability through network namespaces. Let’s explore how network namespaces can help build a cleaner, more isolated solution for managing VM networking.

What are Network Namespaces?

Most developers familiar with Docker have encountered the concept of namespaces, particularly network namespaces. This Linux kernel feature allows you to create isolated network stacks on the same physical host, each appearing as a completely separate network environment. According to the Linux manual pages:

Network namespaces provide isolation of the system resources associated with networking: network devices, IPv4 and IPv6 protocol stacks, IP routing tables, firewall rules, the /proc/net directory (which is a symbolic link to /proc/pid/net), the /sys/class/net directory, various files under /proc/sys/net, port numbers (sockets), and so on. In addition, network namespaces isolate the UNIX domain abstract socket namespace (see unix(7)).

This is exactly what we need—a completely separate network stack with its own devices, routing tables, and firewall rules. However, when you create a new network namespace, it starts empty with no network devices. So how do we connect it to the internet? The Linux manual explains the solution:

A virtual network (veth(4)) device pair provides a pipe-like abstraction that can be used to create tunnels between network namespaces, and can be used to create a bridge to a physical network device in another namespace. When a namespace is freed, the veth(4) devices that it contains are destroyed.

The key insight here is the automatic cleanup: when a namespace is deleted, all its contained veth devices are automatically destroyed—exactly the behavior we want!

Creating and Configuring a Network Namespace

Since our host network stack has internet connectivity, we need to connect our new namespace to the host network using a veth pair (which acts like a virtual ethernet cable). For the pair to communicate, both ends need IP addresses. Here are the commands to set this up:

# Create a new network namespace called 'example'
sudo ip netns add example

# Create a veth pair (virtual ethernet cable)
sudo ip link add host-veth type veth peer name example-veth

# Move one end of the veth pair into the new namespace
# (initially both ends exist in the host namespace)
sudo ip link set example-veth netns example

# Assign IP addresses to both ends of the veth pair
sudo ip addr add 192.168.26.1/24 dev host-veth              # Host end
sudo ip netns exec example ip addr add 192.168.26.2/24 dev example-veth  # Namespace end

# Bring both interfaces up
sudo ip link set dev host-veth up
sudo ip netns exec example ip link set dev example-veth up

After executing these commands, we have successfully configured a new network namespace and connected it to the host namespace via a veth pair. Let’s test the connectivity with ip netns exec example ping 192.168.26.1:

PING 192.168.26.1 (192.168.26.1) 56(84) bytes of data.
bytes from 192.168.26.1: icmp_seq=1 ttl=64 time=0.038 ms
bytes from 192.168.26.1: icmp_seq=2 ttl=64 time=0.073 ms
bytes from 192.168.26.1: icmp_seq=3 ttl=64 time=0.070 ms

Excellent! The connection works. Notice that network devices belonging to different namespaces are isolated from each other (try running ip a in both namespaces to see this separation).

Now we have two separate network stacks that can communicate with each other. However, only the host can access the internet. To provide internet access to our new namespace, we need to configure routing and NAT rules.

Enabling Internet Access

First, we need to configure the namespace to route all traffic through the host veth interface:

# Set default route in the namespace to use the host veth interface
sudo ip netns exec example ip route add default via 192.168.26.1

Next, we need to configure the host to forward traffic and perform NAT:

# Enable IP forwarding in the kernel
sudo sysctl -w net.ipv4.ip_forward=1

# Allow established connections from internet back to namespace
sudo iptables -A FORWARD -i enp0s1 -o host-veth -m state --state RELATED,ESTABLISHED -j ACCEPT

# Allow new outgoing connections from namespace to internet
sudo iptables -A FORWARD -i host-veth -o enp0s1 -j ACCEPT

# Masquerade (NAT) traffic from the namespace subnet
sudo iptables -t nat -A POSTROUTING -s 192.168.26.0/24 -o enp0s1 -j MASQUERADE

Note: Replace enp0s1 with your actual physical network interface name (find it with ip route show default).

Now the namespace can reach the internet! Test with sudo ip netns exec example ping 8.8.8.8:

PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
bytes from 8.8.8.8: icmp_seq=1 ttl=117 time=10.2 ms
bytes from 8.8.8.8: icmp_seq=2 ttl=117 time=9.66 ms
bytes from 8.8.8.8: icmp_seq=3 ttl=117 time=9.31 ms

Adding Bridge and TAP Devices for VMs

Now we have established a separate network stack connected to both the host and internet. This is already powerful, but for my use case, I wanted to run all VMs inside this isolated network namespace to avoid polluting the host networking and enable easy cleanup—simply delete the namespace and all virtual interfaces disappear automatically.

To achieve this, we need to make a few adjustments:

Create a bridge within the namespace
Remove the IP address from the namespace veth interface
Assign the IP address to the bridge instead
Set the bridge as master for the veth interface
Connect all VM TAP devices to this bridge

# Create a bridge in the namespace
sudo ip netns exec example ip link add name br0 type bridge

# Remove IP from veth interface and add it to the bridge
sudo ip netns exec example ip addr del 192.168.26.2/24 dev example-veth
sudo ip netns exec example ip addr add 192.168.26.2/24 dev br0

# Add veth interface to the bridge
sudo ip netns exec example ip link set example-veth master br0

# Bring the bridge up
sudo ip netns exec example ip link set br0 up

Now all VM TAP devices created within this namespace will use the bridge as their master, and all VM networking components live in the dedicated namespace. For implementation details, see this pull request showing how this was integrated into qcontroller.

Bonus: Embedded DHCP Server

This networking redesign was partly motivated by the inconvenience of relying on external DHCP servers. Managing a separate DHCP service—starting it independently and configuring interfaces—initially seemed like it would provide flexibility, but in practice proved cumbersome.

I wanted to integrate a DHCP server directly into qcontroller, but faced a significant obstacle: DHCP servers must bind to port 67. If the host system already has a DHCP service running on this port, you cannot start another one in the same network namespace.

Network namespaces solve this elegantly! Since each namespace has its own isolated network stack, including port space, you can run a DHCP server on port 67 within the namespace without conflicts. This allows qcontroller to provide integrated DHCP services for VM networking while keeping everything cleanly separated from the host system.

Conclusion

Network namespaces provide an elegant solution for isolating VM networking infrastructure. Key benefits include:

Clean separation of VM networking from host networking
Automatic cleanup when deleting the namespace
Port isolation enabling embedded services like DHCP
Complete control over routing, firewall rules, and network topology
Simplified management through namespace-scoped operations

By leveraging network namespaces, we can build more robust and maintainable virtualization solutions that don’t interfere with the host system’s networking configuration.

Running QEMU VMs on ARM64: UEFI Requirements

2025-10-05T00:00:00+00:00

In my previous notes, I’ve discussed how QEMU serves as a versatile and flexible tool for creating and managing virtual machines. One of QEMU’s greatest strengths is its support for a wide range of platforms, making it an ideal choice for cross-platform development and testing. However, this versatility requires us to understand the subtle differences between architectures when configuring our VMs.

In this article, I’ll explain why the QEMU commands that work for x86_64 platforms require specific adjustments when running ARM64 VMs, with a particular focus on the UEFI firmware requirements that are essential for ARM64 virtualization.

Understanding the Difference: ARM64 vs x86_64 Booting

When working with ARM64 architecture, there’s a fundamental difference in how the system boots compared to traditional x86_64 systems. While ARM64 can utilize different boot methods including U-Boot for embedded systems, UEFI (Unified Extensible Firmware Interface) is the default and preferred method for server and cloud environments. As documented in the Ubuntu server virtualization guide, Ubuntu ARM64 cloud images specifically rely on UEFI for hardware initialization and kernel loading.

Unlike x86_64, which can boot using legacy BIOS or UEFI without additional configuration in QEMU, ARM64 cloud images typically require explicitly configured UEFI firmware. When using QEMU for ARM64 virtualization with cloud images like Ubuntu, we must explicitly provide:

UEFI Firmware (.fd) file: These files contain the actual UEFI firmware code, which includes drivers, bootloaders, and the pre-boot environment for the system. Think of this as the replacement for traditional BIOS.
UEFI Variables (.vars) file: These store data in the system’s non-volatile RAM (NVRAM) that control the UEFI environment. This includes critical information such as the default boot entry, boot order, and secure boot settings.

Finding Available Firmware Files

Fortunately, when you install QEMU, it automatically includes supported firmware files for various architectures. To locate the firmware files available in your QEMU installation, run:

qemu-system-aarch64 -L help

This command will display output similar to:

/opt/homebrew/Cellar/qemu/10.1.0/bin/../share/qemu-firmware
/opt/homebrew/Cellar/qemu/10.1.0/bin/../share/qemu

These directories contain both firmware and UEFI variable files for different architectures. For ARM64 (aarch64) with the “virt” machine type, the suitable firmware is typically edk2-aarch64-code.fd.

Properly Configuring ARM64 VMs

To run an ARM64 VM, we need to adjust our QEMU command from what we might use for x86_64. Here’s a proper example for running an Ubuntu ARM64 cloud image:

qemu-system-aarch64 \
  -machine virt -accel hvf -m 2048 \
  -nographic -hda ./ubuntu-25.04-server-cloudimg-amd64.img \
  -smbios type=1,serial=ds='nocloud;s=http://192.168.178.37:8000/'
  -bios edk2-aarch64-code.fd

Let’s break down the new elements that are specific to ARM64:

-machine virt: We use the “virt” machine type instead of “q35” (which is for x86_64)
-bios: option to specify firmware

The bios parameter is critical here as it tells QEMU to use UEFI firmware.

Conclusion

Running ARM64 VMs with QEMU requires understanding the essential role that UEFI plays in the boot process. By correctly specifying the firmware, you can successfully run ARM64 virtual machines even on different host architectures.

Useful links

Local DNS Resolution for Docker Containers in Development

2025-09-07T00:00:00+00:00

The challenge: service discovery in containers

In modern backend development, most systems run in isolated environments—most commonly, containers. A typical backend consists of several services that need to communicate with each other. Orchestrators like Kubernetes and Docker Compose provide internal DNS so services can reach each other by hostname. It’s convenient and often feels like magic.

Why internal DNS isn’t enough (the “public URL” problem)

What if you need to access your service via a public URL? Imagine a reverse proxy fronting everything, with Keycloak behind it and an oauth2-proxy handling authentication via Keycloak. oauth2-proxy needs:

--redirect-url — the URL the browser hits
--oidc-issuer-url — the URL the proxy uses to obtain tokens

To avoid CSRF issues you also set --cookie-secure=true. Because clients reach Keycloak through the reverse proxy, the redirect URL must point to the proxy; the issuer URL should also point to Keycloak. You could use an internal DNS name for the issuer URL, but that breaks CSRF checks—both URLs must share the same hostname, which is typically a public domain you don’t have in local dev. Dilemma.

Why mismatched hostnames trigger “CSRF” errors

During the OAuth/OIDC flow your proxy sets a short-lived value (state/nonce) in a cookie on the exact host the user is visiting (e.g. auth.local.test). When Keycloak redirects back, the proxy must compare the state in the callback URL with the copy stored in that cookie. That comparison is the CSRF defence. If you mix hosts—say the browser hits https://auth.local.test but your issuer is http://keycloak:8080—the browser won’t send the cookie to the other host. Different host => different cookie scope. On top of that, --cookie-secure=true means the cookie is only sent over HTTPS, so any HTTP hop drops it. Modern SameSite rules also treat different hosts as “cross-site”, which further blocks the cookie from riding along. The proxy can’t find the cookie it set, the state check fails, and you get a CSRF error.

This is why resolving the “public” name to your container locally is so effective: every step sees the same host, so the browser sends the right cookie and the CSRF check passes.

Existing Solutions

At this point, you either fake domains in /etc/hosts or look for a tool that maps container names to hostnames. I started with the smart devdns project and even tried automating hosts-file updates on Docker start/stop. It worked, but hosts files are brittle and easily clobbered. I wanted something that behaves like real DNS without hand-editing files.

A Better Approach: Local DNS Server for Containers

Run a local DNS server that watches running containers. If a query matches a container’s name (or alias), answer with the container’s IP. Otherwise, forward to your normal upstreams (Google, Cloudflare, etc.). Docker’s APIs are great in Go, and miekg/dns makes DNS straightforward, so I built a tiny server in Go. You can find the code here.

How it works (at a glance)

Browser asks DNS for example.com.
Local resolver checks if a container named/aliased example.com is running.
If yes → return the container’s IP. If no → forward to public DNS and return that IP.

How to Use the Local DNS Server

When you run the DNS server locally (for example, on port 53), it will resolve container names to their IP addresses automatically. Here’s a simple example using Docker Compose:

services:
  ubuntu:
    image: ubuntu:latest
    container_name: github.com
    command: ["sleep", "infinity"]

After starting this Compose file, any DNS query for github.com will resolve to the IP address of the ubuntu container. For instance, running dig github.com will return:

; <<>> DiG 9.20.4-3ubuntu1.2-Ubuntu <<>> github.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44158
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;github.com.                    IN      A

;; ANSWER SECTION:
github.com.             0       IN      A       172.21.0.2

;; Query time: 2 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Sun Sep 07 19:24:24 CEST 2025
;; MSG SIZE  rcvd: 55

Notice that the IP address in the answer section matches the container’s IP. You can verify this with:

docker compose -f /tmp/docker-compose.yml ps -q ubuntu | xargs docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}'
172.21.0.2

Configuring Your System to Use the DNS Server

Note: To use this DNS server, configure your system to point to it. For example, if using systemd-resolved:
sudo resolvectl dns  127.0.0.1:5300
This change is temporary and will reset on reboot. To revert manually:
sudo systemctl restart systemd-resolved

Conclusion

Local dev often breaks when parts of your stack see different hostnames. A tiny local DNS server fixes that: resolve container names to their IPs, forward everything else upstream, and your dev environment starts behaving like production without hacks.

Why this helps you

One hostname end-to-end → fewer auth/cookie surprises.
No manual hosts edits
Works with Compose out of the box; trivial to verify with dig.