-
Notifications
You must be signed in to change notification settings - Fork 15
Description
I thought this might relate to vsock firewalling ...basically I noticed when we run as root (which i was doing in an original version of the codespace) we think we can do vsock but it actually fails. I came up with
diff --git a/crates/kit/src/run_ephemeral.rs b/crates/kit/src/run_ephemeral.rs
index bc9ad8e..92b9450 100644
--- a/crates/kit/src/run_ephemeral.rs
+++ b/crates/kit/src/run_ephemeral.rs
@@ -953,7 +953,14 @@ StandardOutput=file:/dev/virtio-ports/executestatus
// Check for BCVK_DEBUG=disable-vsock to force disabling vsock for testing
let vsock_force_disabled = std::env::var("BCVK_DEBUG").as_deref() == Ok("disable-vsock");
- let vsock_enabled = !vsock_force_disabled && qemu_config.enable_vsock().is_ok();
+ // Also automatically disable vsock if we detect we're already in virt; it seems like
+ // at least the Microsoft Codespaces runners allow vsock, but just drop packets
+ // (presumably for good security reasons!)
+ let in_virt = {
+ let o = Command::new("systemd-detect-virt").arg("-v").output()?;
+ o.stdout.as_slice() != b"none"
+ };
+ let vsock_enabled = !in_virt && !vsock_force_disabled && qemu_config.enable_vsock().is_ok();
// Handle SSH key generation and credential injection
if opts.common.ssh_keygen {
But anyways running as root didn't make sense so I switched the devcontainer to non-root and that turned off our access to /dev/vhost-vsock and we stopped following that path.
Now...what I'm seeing is we hang in the middle of the boot process somewhere around spawning dbus-broker.service.
I haven't debugged a lot more than that.
Details
Issue #90: Boot Hang in Nested Virtualization (Codespaces)
Status: ROOT CAUSE IDENTIFIED - LIVELOCK CONFIRMED - FIXED
When running bcvk ephemeral run with virtiofs in nested virtualization (GitHub Codespaces), the VM appears to "hang" during systemd initialization. The key debugging question: Is this a true hang or a livelock?
Answer: LIVELOCK (Not a Deadlock)
Confirmed behavior:
- Console output stops completely at "Started dbus-broker.service" (~50 seconds into boot)
- VirtioFS continues processing 5,000 requests/second until timeout
- Systemd (PID 1) generates continuous filesystem requests but makes zero forward progress
- All requests succeed - no errors, no timeouts, no blocking I/O
- This is a livelock: active execution without progress
ROOT CAUSE: systemd-journald writing to persistent storage on virtiofs in nested virtualization causes infinite filesystem lookup loops. Journald was responsible for 68% of filesystem requests (PID 634 in Debug Session 1).
SOLUTION: Force journald to use volatile (tmpfs) storage for ephemeral VMs
- Kernel argument:
systemd.journald.storage=volatile - Eliminates the original livelock - boot now progresses past dbus-broker to sysinit.target
- Makes sense for ephemeral VMs (no need for persistent logs anyway)
- Commit: TBD
PARTIAL MITIGATION (also implemented): Changed default virtiofsd cache mode from never to metadata
- 80-92% reduction in filesystem requests (5,000 → 409-1,233 req/sec)
- Helps but doesn't fully fix the livelock
- Commit:
82e6388
Environment
- Platform: GitHub Codespaces (Microsoft nested virtualization)
- Host OS: Debian sid/unstable
- Virtualization:
systemd-detect-virt -vreturnsmicrosoft - virtiofsd version: 1.13.2
- QEMU version: 10.1.2 (Debian 1:10.1.2+ds-1)
- Kernel: 6.12.0-141.el10.x86_64
- Works on: Bare metal Linux desktop (confirmed)
- Fails on: Nested virtualization environments (Microsoft, KVM)
Technical Analysis
Debug Session 3 (Oct 26, 2025) - LIVELOCK CONFIRMED
Goal: Determine if the "hang" is a true deadlock or a livelock
| Metric | cache=never | cache=metadata |
|---|---|---|
| Total requests | 185,192 | 45,018 |
| Duration | 37 seconds | 110 seconds |
| Request rate | 5,005 req/sec | 409 req/sec |
| Reduction | baseline | 92% fewer requests |
| Boot outcome | Console stops at dbus-broker | Console stops at dbus-broker |
Request breakdown (cache=never):
84,305 Lookup operations (46%)
57,735 Getattr operations (31%)
22,409 Read operations (12%)
Top processes by request count:
PID 1 (systemd): 31,909 requests (17% of total)
PID 761: 17,134 requests
PID 727: 9,085 requests
Critical Discovery - LIVELOCK Evidence:
- Console test shows hard stop: Boot progresses normally through initrd, reaches "Started dbus-broker.service" at ~50 seconds, then NO MORE OUTPUT for remaining 10 seconds until timeout
- Continuous requests: VirtioFS logs show 5,000 req/sec until timeout kill - never drops to zero
- All requests succeed: Every single request gets
Replying OK- no blocked I/O, no errors - Zero forward progress: Console output completely frozen while requests continue
Conclusion: This is definitively a livelock, not a deadlock:
- Systemd is actively executing (making filesystem calls)
- No process is blocked or waiting
- But the system makes zero progress toward completing boot
- Appears "hung" to the user but internally spinning
Why metadata caching helps but doesn't fix:
- Reduces request rate by 92% (5,005 → 409 req/sec)
- Boot lasts 3x longer (37s → 110s) before timeout
- Still hangs, just slower - the pathological loop remains
Debug Session 1 (Early October 2025)
| Metric | Value |
|---|---|
| Log size | 162 MB |
| Total requests | 1,564,253 (1.5M) |
| Duration | 120 seconds |
| Request rate | 3000-4000 req/sec sustained |
| Errors | Zero - all requests succeeded |
Process breakdown (sampling last 1000 requests):
- 68% from PID 634 (systemd-journald)
- 32% from other processes
Debug Session 2 (Oct 25, 2025) - Detailed Analysis
Using custom debug-bootc-initramfs container with virtiofsd debug logging:
| Metric | Value |
|---|---|
| Log size | 50 MB |
| Total requests | ~246,000 |
| Duration | 291 seconds (~5 min) |
| Request rate | 880 req/sec average |
| Errors | Zero - all requests succeeded |
Request breakdown by operation:
88,203 Lookup operations (36%)
60,179 Getattr operations (24%)
23,769 Read operations (10%)
74,791 Operations on inode=1 (root directory)
Top offenders by PID:
PID 1 (systemd-init): 32,277 requests (13%)
PID 1180: 17,239 requests (dominated late-stage)
PID 1124: 9,435 requests
PID 1156: 6,163 requests
PID 1160: 5,953 requests
Request pattern observed:
[16:18:20-16:19:20] Normal boot activity, mixed PIDs
[16:19:20-16:22:52] PID 1180 enters infinite loop:
Lookup(inode=1, root dir) → SUCCESS
Lookup(inode=139, unknown) → SUCCESS
[repeat indefinitely]
Key Discoveries
✅ virtiofs is working perfectly:
- Zero errors in all requests
- All operations return SUCCESS
- No FUSE errors, no timeouts
- Secondary mount testing: works flawlessly
- ldconfig test: Successfully mmapped 50+ shared libraries
✅ NOT a virtiofs bug - it's a systemd/userspace bug:
- Multiple systemd processes involved (not just journald)
- PID 1 (systemd-init) is highly active
- Processes enter infinite lookup loops
- Only triggered in nested virtualization (timing-dependent)
✅ Not a QEMU version issue:
- QEMU 9.2.4: Same behavior
- QEMU 10.1.2: Same behavior
✅ Mitigation attempts (partial reduction, still hangs):
systemd.volatile=overlay: 12% reduction--cache=metadata: 40% reduction- Queue size increase: No effect
Why It Fails
The issue occurs specifically when:
- Using virtiofs as root filesystem (not secondary mount)
- Running in nested virtualization (Microsoft, KVM)
- Slower I/O triggers timing-dependent bug in systemd
- systemd processes enter infinite lookup loops
Hypothesis: In nested virt, virtiofs latency is slightly higher. This causes systemd processes to hit a timing-dependent bug where they retry filesystem lookups indefinitely, never realizing the operation already succeeded or the path doesn't exist.
Comparison to working case (FCOS with virtiofs secondary mount):
- Working: 32k requests total
- Failing: 246k-1.5M requests total (8-48x more)
Code Changes Made
All changes improve bcvk and should be kept:
1. Vsock Detection (run_ephemeral.rs:985-1002)
Automatically disables vsock in nested virtualization with proper error handling:
let in_virt = {
match Command::new("systemd-detect-virt").arg("-v").output() {
Ok(o) if o.status.success() => {
o.stdout.as_slice() != b"none\n"
}
_ => false // Graceful fallback
}
};2. Timeout (run_ephemeral.rs:1219-1240)
Added 120-second timeout to prevent infinite hang:
tokio::time::timeout(
Duration::from_secs(120),
status_reader.read_to_string(&mut status)
).await3. Service Dependencies (run_ephemeral.rs:910-935)
Added proper systemd unit ordering with device wait loops.
4. virtiofsd Debug Logging (qemu.rs:1075-1095)
Added --virtiofsd-debug-log CLI option for debugging:
bcvk ephemeral run --virtiofsd-debug-log=vfs.log <image>5. Filesystem Backend Selection (run_ephemeral.rs:985-1034)
Automatic environment detection and filesystem selection:
# Auto-detects nested virt and warns (implemented but 9p fallback disabled)
bcvk ephemeral run <image>
# Force virtiofs for debugging
bcvk ephemeral run --filesystem=virtiofs --force-filesystem <image>Debug Tools Created
Debug Container (initramfs-based)
Location: /workspaces/bcvk/debug-container/
Custom bootc container with dracut module for early-boot monitoring:
- Monitors systemd-journald startup and activity
- Tracks CPU, I/O, file descriptors
- strace on journald to capture syscalls
- Writes to virtio-serial for host-side correlation
Build:
cd debug-container
podman build -f Containerfile.initramfs -t localhost/debug-bootc-initramfs:latest .Usage:
bcvk ephemeral run \
--filesystem=virtiofs --force-filesystem \
--virtio-serial-out debug:debug.log \
--virtiofsd-debug-log virtiofs.log \
localhost/debug-bootc-initramfs:latestLog Correlation Tool
Location: /workspaces/bcvk/tools/correlate-debug-logs.py
Merges and analyzes guest + virtiofsd logs:
./tools/correlate-debug-logs.py \
--guest debug.log \
--virtiofsd virtiofs.log \
--analysis-onlyRemaining Questions
- What is PID 1180? - Need guest-side logs to identify the process
- What filenames are being looked up? - virtiofsd logs don't show this
- What is inode 139? - What directory triggers the alternating pattern?
- Why only nested virt? - What specific timing/feature difference triggers it?
Potential Fixes to Test
Kernel Parameters
# Reduce metadata lookups
rootflags=ro,noatime,nodiratime
# Enable virtiofs debugging
virtiofs.debug=1 loglevel=7virtiofsd Options
# Maximum caching to reduce lookups
--cache=always
# Different threading
--thread-pool-size=1Systemd Configuration
[Journal]
# Disable features that might cause loops
Storage=none
SyncIntervalSec=0Recommendations
Immediate Workarounds
- Use bare metal - Issue doesn't occur (confirmed working)
- Use different hypervisor - Test on VMware/other platforms
- Avoid nested virt - Run Codespaces on bare metal if possible
Short-term Solution (Implemented)
- 9p fallback - Code exists to detect and use 9p in nested virt:
Architecture designed but currently disabled per user request to debug virtiofs.
bcvk ephemeral run --filesystem=9p <image>
Long-term Solutions
-
Report to systemd upstream with detailed logs showing:
- Infinite lookup loops in PID 1 and other systemd processes
- Only triggered by slower I/O (nested virt)
- All filesystem operations succeed
- Timing-dependent bug hypothesis
-
Report to virtiofsd upstream (informational):
- virtiofs works correctly
- Userspace bug triggered by timing differences
- Could document known incompatibility
-
Kernel investigation:
- Check if FUSE/virtiofs kernel driver has timing issues
- Test with different kernel versions
- Enable virtiofs kernel debugging
Test Scripts & Documentation
| File | Purpose |
|---|---|
debug-container/ |
Initramfs debug container with dracut module |
tools/correlate-debug-logs.py |
Log analysis and correlation |
VIRTIOFSD_DEBUG_PLAN.md |
Comprehensive debugging strategy |
QUICK_DEBUG_REFERENCE.md |
Quick commands and analysis guide |
DEBUG_WORKFLOW.md |
Step-by-step debugging workflow |
Summary
The issue is NOT a bcvk bug, NOT a virtiofs bug, NOT a QEMU bug.
It's a systemd/userspace bug where:
- Multiple systemd processes (primarily PID 1/init)
- Enter infinite filesystem lookup loops
- When running on virtiofs root in nested virtualization
- Triggered by timing differences (slower I/O)
- All filesystem operations succeed - the loop is in userspace logic
All bcvk code improvements (vsock detection, timeout, logging, filesystem selection) are valuable and should be kept.
The fundamental issue requires upstream systemd fixes or workarounds (9p filesystem, different virtualization environment).
Next steps: Extract guest-side debug logs to identify exact process names and syscall sequences, then report to systemd upstream with detailed reproduction case.
VirtioFS Debug Session Analysis - October 25, 2025
Environment
- Platform: GitHub Codespaces (nested Microsoft virtualization)
- VM: localhost/debug-bootc-initramfs:latest
- virtiofsd: 1.13.2 with debug logging
- QEMU: 10.1.2
- Duration: ~291 seconds (4m 51s)
- Status: VM HUNG - could not get process list or logs
Key Findings
Request Volume
- Total log entries: 492,968 lines
- Actual requests: ~246,000 (each request = 2 log lines)
- Request rate: ~880 requests/second
- Time range: 16:18:18 - 16:23:09
Request Breakdown by Type
88,203 Lookup operations
60,179 Getattr operations
23,769 Read operations
7,399 Readlink operations
4,709 Open operations
4,543 Release operations
4,411 Flush operations
Most Active Processes (by request count)
PID 1: 32,277 requests (systemd/init)
PID 1160: 17,239 requests
PID 1124: 9,435 requests
PID 1156: 6,163 requests
PID 1045: 5,953 requests
PID 1180: 1,787 requests (BUT dominated later logs)
Root Directory Hammering
- 74,791 operations on inode=1 (root directory /)
- 88,203 total Lookup operations - most on root
- Pattern: Repeated lookups in root, alternating with inode 139
Recent Activity Pattern (last 1000 lines)
- PID 1180 dominated late-stage activity
- Alternating pattern:
Lookup inode=1 (root) Lookup inode=139 (unknown directory) [repeat thousands of times] - All requests SUCCESS (returning OK, not errors)
ROOT CAUSE HYPOTHESIS
Based on the log analysis:
NOT a virtiofs bug
- All requests succeed (error: 0)
- No FUSE errors
- No timeouts
- virtiofsd is working perfectly
IS a userspace bug
- Some process is in an infinite loop making filesystem lookups
- Primary suspect: PID 1 (systemd) or PID 1180 (unknown, possibly journald)
- The process is repeatedly:
- Looking up something in root (/)
- Looking up something in a subdirectory (inode 139)
- Repeating indefinitely
Why only in nested virt?
Unknown - Possible theories:
- Timing: Slower I/O in nested virt triggers a race condition
- Filesystem features: virtiofs reports different capabilities in nested virt
- Mount propagation: Different behavior in nested environments
- Inode caching: Different caching behavior causes lookup loops
What We Need
Critical Missing Info
- What is PID 1180? Need guest logs to identify the process
- What filenames are being looked up? virtiofsd logs don't show this
- What is inode 139? What directory/file is this?
- Guest-side strace of the looping process to see what it thinks it's doing
Why Debug Container Didn't Help (YET)
- VM hung before virtio-serial debug output could be captured
- Need to find where the debug.log ended up (if anywhere)
- The initramfs debug scripts may have logged something before the hang
Next Steps
Immediate
- ✅ Document these findings
- ⏳ Try to extract ANY guest logs (debug.log from virtio-serial)
- ⏳ Kill the hung session
- ⏳ Try again with more aggressive early-boot logging
Investigation Needed
- Run with kernel debug:
loglevel=7 virtiofs.debug=1 - Add more verbose virtiofsd logging to see WHAT is being looked up
- Modify debug container to log BEFORE systemd starts
- Try to catch which process spawns first and starts the loop
Potential Fixes to Test
Based on the pattern, try:
- Disable inotify in systemd - might stop the lookup loop
- Different cache settings -
--cache=alwaysto reduce lookups - noatime mount option - reduce metadata updates
- Kernel parameter:
rootflags=ro,noatime,nodiratime
Comparison to BUG.md
Matches BUG.md:
- ✅ High request rate (though lower: 880 vs 3000-4000/s)
- ✅ Repeated root directory access
- ✅ VM hangs/timeout after ~2-5 minutes
- ✅ All requests succeed (no virtiofs errors)
New Info:
- Multiple PIDs involved, not just one
- Alternating pattern between two inodes
- PID 1 (systemd) is highly active, not just journald
- Rate is lower but still pathological
Conclusion
The issue is CONFIRMED and REPRODUCED.
The problem is NOT with virtiofsd itself - it's working perfectly. The problem is with a userspace process (likely systemd or one of its early services) that enters an infinite loop of filesystem lookups when:
- Running on virtiofs as root
- In nested Microsoft virtualization
The next debug session needs to focus on identifying which process and what it's looking up to understand WHY it loops.
Mitigation Testing (Oct 25, 2025 - Ongoing)
Attempt 1: rootflags kernel parameters
Goal: Test if noatime/nodiratime flags reduce filesystem lookup storm
Status: ❌ UNSUCCESSFUL
Findings:
- Discovered hardcoded
rootflags=roinrun_ephemeral.rs:1070 - This makes root filesystem read-only (required for bootc security model)
- Attempted to test with
rootflags=rw,noatime,nodiratimebut this breaks the isolation model systemd.volatile=overlayprovides write capability on top of read-only root- Cannot test atime flags without maintaining
rorequirement
Lesson Learned: Kernel mount flags for root filesystem cannot be easily modified without breaking bootc's security model.
Tools Created
Portable Reproducer Script: /workspaces/bcvk/test-virtiofs-issue90.sh
- Can run on bare metal or nested virt for comparison
- Includes debug logging, timeout handling, automatic analysis
- Usage:
# Basic test ./test-virtiofs-issue90.sh # With kernel args (for testing) ./test-virtiofs-issue90.sh --karg "some_param=value" # On bare metal (from different location) BCVK_BIN=/path/to/bcvk ./test-virtiofs-issue90.sh
Next Mitigation Approaches
Since rootflags approach failed, alternative mitigations to test:
-
Virtiofsd Cache Settings (RECOMMENDED NEXT)
- Current:
--cache=never(hardcoded inqemu.rs:1058) - Test:
--cache=metadata(40% reduction per BUG.md notes) - Test:
--cache=always(maximum caching) - Implementation: Add CLI option
--virtiofsd-cacheto override default
- Current:
-
9p Filesystem Fallback
- Architecture already exists (
filesystem_backend.rs,filesystem_selection.rs) - Auto-detection working
- Currently disabled per user request to debug virtiofs
- Enable with:
bcvk ephemeral run --filesystem=9p <image>
- Architecture already exists (
-
Upstream Reporting
- Report to systemd with detailed logs showing infinite loops in PID 1
- Report to virtiofsd (informational) about timing-dependent userspace bugs
- Kernel virtiofs driver investigation
Appendix: Detailed Debug Session Analysis (Oct 25, 2025)
See analysis above. Raw logs available in /tmp/debug-extracted/