guix-daemon socket activation does not work on the hurd

  • Done
  • quality assurance status badge
Details
2 participants
  • Ludovic Courtès
  • yelninei
Owner
unassigned
Submitted by
yelninei
Severity
normal

Debbugs page

yelninei wrote 11 months ago
(address . bug-guix@gnu.org)
ONG4TKn--F-9@tutamail.com
Hi,

today i reconfigured my system and after a reboot I am unable to use the guix-daemon on a childhurd.


guix build hello -n
guix build: error: failed to connect to `/var/guix/daemon-socket/socket': Protocol error

Offloading:
guix offload: error: failed to connect over SSH to daemon at 'localhost', socket /var/guix/daemon-socket/socket

Daemon Logs:
socket-activated with 1 socket
unexpected build daemon error: reading from file: Resource temporarily unavailable
Starting the daemon as the root user normally continues to work as before so i suspect the socket activation change is to blame.
Guix commit: 6af680670bf9055b90e6f8b63c4c2ab7b08e7c56
yelninei wrote 11 months ago
(address . 77610@debbugs.gnu.org)
ONLquG3--F-9@tutamail.com
After mentioning this on IRC Ludovic pushed 8d31cafbdcb818160852a5d1e6fc24c1a9c53e41 to the shepherd repo.

I wanted to try this out and reconfigured using the shepherd from this commit as pid1 in the vm (a bit tricky because of help2man).

The first connection still fails in the same way.unexpected build daemon error: reading from file: Resource temporarily unavailable

A client mentions:
guix build: error: corrupt input while restoring archive from #<closed: file 2396ea8>

However subsequent connections work.
Ludovic Courtès wrote 11 months ago
Re: bug#77610: guix-daemon socket activation does not work on the hurd
(address . 77610@debbugs.gnu.org)(address . yelninei@tutamail.com)
87a58he7cg.fsf@gnu.org
yelninei--- via Bug reports for GNU Guix <bug-guix@gnu.org> writes:

Toggle quote (6 lines)
> After mentioning this on IRC Ludovic pushed 8d31cafbdcb818160852a5d1e6fc24c1a9c53e41 to the shepherd repo.
>
> I wanted to try this out and reconfigured using the shepherd from this commit as pid1 in the vm (a bit tricky because of help2man).
>
> The first connection still fails in the same way.unexpected build daemon error: reading from file: Resource temporarily unavailable

I looked a bit into this, and I think shepherd is doing the right
working as expected, making the socket blocking before executing
guix-daemon (it’s clear when stracing it on Linux).

So there must be something specific at play on the Hurd.

I tried this snippet (server on one side, client on the other side) and
it works as expected: ‘accept’ blocks and subsequent read does not get
EAGAIN.

So I’m at loss here. Does ‘tests/systemd.sh’ succeed when ran natively?
(In particular the check added in
8d31cafbdcb818160852a5d1e6fc24c1a9c53e41.)

Thanks,
Ludo’.
(use-modules (ice-9 match)) (define (blocking-port port) "Return PORT after putting it in non-blocking mode." (let ((flags (fcntl port F_GETFL))) (fcntl port F_SETFL (logand (lognot O_NONBLOCK) flags)) port)) (let ((sock (socket AF_UNIX (logior SOCK_STREAM SOCK_NONBLOCK) 0))) (bind sock AF_UNIX "/tmp/sock") (listen sock 10) (match (pk 'x (accept (blocking-port sock) SOCK_CLOEXEC)) ;should block ((port . _) (pk 'read (read port))))) ;; Client: (let ((sock (socket AF_UNIX (logior SOCK_STREAM) 0))) (connect sock AF_UNIX "/tmp/sock") (display "hi!\n" sock))
yelninei wrote 11 months ago
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 77610@debbugs.gnu.org)
ONuQa_2--R-9@tutamail.com
Hello,

Apr 15, 2025, 16:08 by ludo@gnu.org:

Toggle quote (24 lines)
> yelninei--- via Bug reports for GNU Guix <bug-guix@gnu.org> writes:
>
>> After mentioning this on IRC Ludovic pushed 8d31cafbdcb818160852a5d1e6fc24c1a9c53e41 to the shepherd repo.
>>
>> I wanted to try this out and reconfigured using the shepherd from this commit as pid1 in the vm (a bit tricky because of help2man).
>>
>> The first connection still fails in the same way.unexpected build daemon error: reading from file: Resource temporarily unavailable
>>
>
> I looked a bit into this, and I think shepherd is doing the right
> working as expected, making the socket blocking before executing
> guix-daemon (it’s clear when stracing it on Linux).
>
> So there must be something specific at play on the Hurd.
>
> I tried this snippet (server on one side, client on the other side) and
> it works as expected: ‘accept’ blocks and subsequent read does not get
> EAGAIN.
>
> So I’m at loss here. Does ‘tests/systemd.sh’ succeed when ran natively?
> (In particular the check added in
> 8d31cafbdcb818160852a5d1e6fc24c1a9c53e41.)
>

Yes, it is passing both on 1.0.3 and 1.0.4. The only thing failing now is the system-log test.
As before when using #:lazy-start #f it works as expected which makes the only difference the timing of the first connection. What would the most minimal guix-daemon client need to look like to trigger the EAGAIN
 
I tried to verify that the port is definitly blocking before being passed to guix-daemon and it is. I am very confused.

Do you know of other processes (with not a lot of dependencies) that can be socket activated to try to replicate this with something less complicated than guix-daemon?



Toggle quote (23 lines)
> Thanks,
> Ludo’.
>
> (use-modules (ice-9 match))
>
> (define (blocking-port port)
> "Return PORT after putting it in non-blocking mode."
> (let ((flags (fcntl port F_GETFL)))
> (fcntl port F_SETFL (logand (lognot O_NONBLOCK) flags))
> port))
>
> (let ((sock (socket AF_UNIX (logior SOCK_STREAM SOCK_NONBLOCK) 0)))
> (bind sock AF_UNIX "/tmp/sock")
> (listen sock 10)
> (match (pk 'x (accept (blocking-port sock) SOCK_CLOEXEC)) ;should block
> ((port . _)
> (pk 'read (read port)))))
>
> ;; Client:
> (let ((sock (socket AF_UNIX (logior SOCK_STREAM) 0)))
> (connect sock AF_UNIX "/tmp/sock")
> (display "hi!\n" sock))
>
Ludovic Courtès wrote 11 months ago
(address . yelninei@tutamail.com)(address . 77610@debbugs.gnu.org)
87h62n7tbu.fsf@gnu.org
Hi,

yelninei@tutamail.com writes:

Toggle quote (7 lines)
>> So I’m at loss here. Does ‘tests/systemd.sh’ succeed when ran natively?
>> (In particular the check added in
>> 8d31cafbdcb818160852a5d1e6fc24c1a9c53e41.)
>>
>
> Yes, it is passing both on 1.0.3 and 1.0.4. The only thing failing now is the system-log test.

Intriguing.

Toggle quote (9 lines)
> As before when using #:lazy-start #f it works as expected which makes
> the only difference the timing of the first connection. What would the
> most minimal guix-daemon client need to look like to trigger the
> EAGAIN
>  
> I tried to verify that the port is definitly blocking before being passed to guix-daemon and it is. I am very confused.
>
> Do you know of other processes (with not a lot of dependencies) that can be socket activated to try to replicate this with something less complicated than guix-daemon?

Well there’s ‘guix publish’, and otherwise the examples from
‘tests/systemd.sh’ (following ‘define %command’).

Otherwise we could mimic it by writing a C program that that opens a
SOCK_NONBLOCK socket, binds + listens + select(2) until something
happens, then calls fcntl(2) to clear the O_NONBLOCK flag, and then
forks + execs and call accept(2) in the child process.

Ludo’.
yelninei wrote 11 months ago
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 77610@debbugs.gnu.org)
OO6zDvS--F-9@tutamail.com
Hello,

Apr 16, 2025, 20:19 by ludo@gnu.org:

Toggle quote (10 lines)
> Well there’s ‘guix publish’, and otherwise the examples from
> ‘tests/systemd.sh’ (following ‘define %command’).
>
> Otherwise we could mimic it by writing a C program that that opens a
> SOCK_NONBLOCK socket, binds + listens + select(2) until something
> happens, then calls fcntl(2) to clear the O_NONBLOCK flag, and then
> forks + execs and call accept(2) in the child process.
>
> Ludo’.
>
I tested guix-publish and that had no issues.

Some checks I did yesterday with guix-dameon:
- Shepherd is passing a blocking socket
- The "fdSocket" in "acceptConnection" is always blocking.
- the "remote" socket in "acceptConnection" is O_NONBLOCK on the first connection only.
- Then also the "from.fd" socket in  "processConnection" is O_NONBLOCK on the first connectionThis then causes EAGAIN on trying to read the clientVersion.

On linux none of this is an issue.
Adding the same check as for the fd 3 socket  for O_NONBLOCK to the "connection" socket after accept  to tests/systemd.sh passes on Linux but causes a failure on the Hurd.

Is glibc accept doing something weird?
I am struggling to understand how the first connection would be any different than subsequent ones (and only in the #:lazy-start? #t case)

I am unsure what to do about this because shepherd seems to do everything correctly. I saw that ci.g.g.o has started to build i586-gnu substitutes (in particular gcc-final) but if you are restarting the builders more aggressively now then each first build will fail because of this and idk if cuirass can reschedule builds on such failures.

Maybe the easiest is to to expose the #:lazy-start? option for now and disable it for guix-daemon in %base-services/hurd ?
Ludovic Courtès wrote 11 months ago
(address . yelninei@tutamail.com)(address . 77610@debbugs.gnu.org)
87v7r1yfeu.fsf@gnu.org
Hi,

yelninei@tutamail.com writes:

Toggle quote (2 lines)
> I tested guix-publish and that had no issues.

You mean the first ‘wget -O …’ passes?

Toggle quote (5 lines)
> Some checks I did yesterday with guix-dameon:
> - Shepherd is passing a blocking socket
> - The "fdSocket" in "acceptConnection" is always blocking.
> - the "remote" socket in "acceptConnection" is O_NONBLOCK on the first connection only.

Looking at ‘accept4.c’ in libc, the only way ‘remote’ can be O_NONBLOCK
is if:

1. ‘accept4’ is passed SOCK_NONBLOCK, but that’s not the case here
(see ‘accept.c’);

2. ‘__socket_accept’ returns a O_NONBLOCK socket, which would be a bug
in the server, pflocal.

At first sight ‘S_io_set_all_openmodes’ in pflocal does the job and
‘S_socket_accept’ honors those flags.

Toggle quote (4 lines)
> Adding the same check as for the fd 3 socket  for O_NONBLOCK to the
> "connection" socket after accept  to tests/systemd.sh passes on Linux
> but causes a failure on the Hurd.

So we have a reproducer.

Could you pass it on to bug-hurd? :-) It may be easier if the whole
thing is in C.

Toggle quote (7 lines)
> I am unsure what to do about this because shepherd seems to do
> everything correctly. I saw that ci.g.g.o has started to build
> i586-gnu substitutes (in particular gcc-final) but if you are
> restarting the builders more aggressively now then each first build
> will fail because of this and idk if cuirass can reschedule builds on
> such failures.

Yeah, it’s not great. Those will have to be restarted manually I’m
afraid, but most of the time anybody can click on the “Restart” button
in Cuirass.

Toggle quote (2 lines)
> Maybe the easiest is to to expose the #:lazy-start? option for now and disable it for guix-daemon in %base-services/hurd ?

Hmm maybe. Let’s first figure out if this is Hurd bug.

Thanks for investigating!

Ludo’.
Closed
yelninei wrote 10 months ago
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 77610@debbugs.gnu.org)
OQFFJfe--F-9@tutamail.com
Hi Ludo,

Thank you again for finding the cause.Could we add your patch to our hurd either for master or core-packages-team as it will be a while until it is available in a tagged snapshot.It would fix the hurd ci builders randomly failing, the childhurd system test and the minor annoyance that the manual offload is failing.

From what I can see only adding it to hurd (and not the headers) should not cause a rebootstrap. 
May 14, 2025, 17:03 by ludo@gnu.org:

Toggle quote (9 lines)
> For the record, this issue is now fixed upstream:
>
> https://git.savannah.gnu.org/cgit/hurd/hurd.git/commit/?id=029ab7d7b38c76ba14c24fcbf526ccef29af9e88
> https://lists.gnu.org/archive/html/bug-hurd/2025-05/msg00016.html
>
> Closing!
>
> Ludo’.
>
Ludovic Courtès wrote 10 months ago
(address . yelninei@tutamail.com)(address . 77610@debbugs.gnu.org)
87v7q2yibm.fsf@gnu.org
Hi yelninei,

yelninei@tutamail.com writes:

Toggle quote (8 lines)
> Thank you again for finding the cause.Could we add your patch to our
> hurd either for master or core-packages-team as it will be a while
> until it is available in a tagged snapshot.It would fix the hurd ci
> builders randomly failing, the childhurd system test and the minor
> annoyance that the manual offload is failing.
>
> From what I can see only adding it to hurd (and not the headers) should not cause a rebootstrap. 

Yes, sounds like a good idea. Do you want to give it a try?

Thanks,
Ludo’.
yelninei wrote 10 months ago
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 77610@debbugs.gnu.org)(name . Janneke Nieuwenhuizen)(address . janneke@gnu.org)
OQI0C9G--F-9@tutamail.com
Hello Ludo,

Something like this? I called the patch hurd-socket-activation.patch to indicate what it is addressing. Do you have a better suggestion?

I added it to master but this will create a minor merge conflict with the hurd update on core-packages-team.

May 14, 2025, 21:51 by ludo@gnu.org:

Toggle quote (18 lines)
> Hi yelninei,
>
> yelninei@tutamail.com writes:
>
>> Thank you again for finding the cause.Could we add your patch to our
>> hurd either for master or core-packages-team as it will be a while
>> until it is available in a tagged snapshot.It would fix the hurd ci
>> builders randomly failing, the childhurd system test and the minor
>> annoyance that the manual offload is failing.
>>
>> From what I can see only adding it to hurd (and not the headers) should not cause a rebootstrap. 
>>
>
> Yes, sounds like a good idea. Do you want to give it a try?
>
> Thanks,
> Ludo’.
>
From 9119ca37613df139db80e36b821a54c137a56037 Mon Sep 17 00:00:00 2001
Message-ID: <9119ca37613df139db80e36b821a54c137a56037.1747296042.git.yelninei@tutamail.com>
From: Yelninei <yelninei@tutamail.com>
Date: Thu, 15 May 2025 07:51:43 +0000
Subject: [PATCH] gnu: hurd: Fix service socket activation.


* gnu/packages/patches/hurd-socket-activation.patch: New patch
* gnu/packages/hurd.scm (hurd): Add it.
* gnu/local.mk: Register it.

Change-Id: Iff7f30099ffeb014aaacdc3a19bd7930795904b6
---
gnu/local.mk | 1 +
gnu/packages/hurd.scm | 1 +
.../patches/hurd-socket-activation.patch | 44 +++++++++++++++++++
3 files changed, 46 insertions(+)
create mode 100644 gnu/packages/patches/hurd-socket-activation.patch

Toggle diff (78 lines)
diff --git a/gnu/local.mk b/gnu/local.mk
index dfafe8b8953..5dc3be1927f 100644
--- a/gnu/local.mk
+++ b/gnu/local.mk
@@ -1591,6 +1591,7 @@ dist_patch_DATA = \
%D%/packages/patches/hurd-64bit.patch \
%D%/packages/patches/hurd-refcounts-assert.patch \
%D%/packages/patches/hurd-rumpdisk-no-hd.patch \
+ %D%/packages/patches/hurd-socket-activation.patch \
%D%/packages/patches/hurd-startup.patch \
%D%/packages/patches/hwloc-1-test-btrfs.patch \
%D%/packages/patches/i7z-gcc-10.patch \
diff --git a/gnu/packages/hurd.scm b/gnu/packages/hurd.scm
index 3b02ed00d1a..443001fbb7b 100644
--- a/gnu/packages/hurd.scm
+++ b/gnu/packages/hurd.scm
@@ -319,6 +319,7 @@ (define-public hurd
(patches (search-patches "hurd-refcounts-assert.patch"
"hurd-rumpdisk-no-hd.patch"
"hurd-startup.patch"
+ "hurd-socket-activation.patch"
"hurd-64bit.patch"))))
(version (package-version hurd-headers))
(arguments
diff --git a/gnu/packages/patches/hurd-socket-activation.patch b/gnu/packages/patches/hurd-socket-activation.patch
new file mode 100644
index 00000000000..e204a90d3aa
--- /dev/null
+++ b/gnu/packages/patches/hurd-socket-activation.patch
@@ -0,0 +1,44 @@
+From 029ab7d7b38c76ba14c24fcbf526ccef29af9e88 Mon Sep 17 00:00:00 2001
+From: =?UTF-8?q?Ludovic=20Court=C3=A8s?= <ludo@gnu.org>
+Date: Thu, 8 May 2025 23:11:36 +0200
+Subject: pflocal: Do not inherit PFLOCAL_SOCK_NONBLOCK across connect/accept.
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+
+Previously, ‘accept’ would return an O_NONBLOCK socket if the listening
+socket was O_NONBLOCK at the time the connection was made. With this
+change, ‘accept’ always returns a socket where O_NONBLOCK is cleared.
+---
+ pflocal/sock.c | 9 ++++++---
+ 1 file changed, 6 insertions(+), 3 deletions(-)
+
+diff --git a/pflocal/sock.c b/pflocal/sock.c
+index 90c618e..6bc061d 100644
+--- a/pflocal/sock.c
++++ b/pflocal/sock.c
+@@ -1,6 +1,6 @@
+ /* Sock functions
+
+- Copyright (C) 1995,96,2000,01,02, 2005 Free Software Foundation, Inc.
++ Copyright (C) 1995,96,2000,01,02, 2005, 2025 Free Software Foundation, Inc.
+ Written by Miles Bader <miles@gnu.org>
+
+ This program is free software; you can redistribute it and/or
+@@ -167,8 +167,11 @@ sock_clone (struct sock *template, struct sock **sock)
+ if (err)
+ return err;
+
+- /* Copy some properties from TEMPLATE. */
+- (*sock)->flags = template->flags & ~PFLOCAL_SOCK_CONNECTED;
++ /* Copy some properties from TEMPLATE. Clear O_NONBLOCK because the socket
++ returned by 'accept' must not inherit O_NONBLOCK from the parent
++ socket. */
++ (*sock)->flags =
++ template->flags & ~(PFLOCAL_SOCK_CONNECTED | PFLOCAL_SOCK_NONBLOCK);
+
+ return 0;
+ }
+--
+cgit v1.1
+

base-commit: 7b73f02c38d568147f1b6a7ff4467f73a212cd1e
--
2.49.0
Ludovic Courtès wrote 10 months ago
(address . yelninei@tutamail.com)(address . 77610@debbugs.gnu.org)(name . Janneke Nieuwenhuizen)(address . janneke@gnu.org)
875xhxlk9r.fsf@gnu.org
Hello,

yelninei@tutamail.com writes:

Toggle quote (3 lines)
> Something like this? I called the patch hurd-socket-activation.patch
> to indicate what it is addressing. Do you have a better suggestion?

Perfect; applied, thank you.

Toggle quote (2 lines)
> I added it to master but this will create a minor merge conflict with the hurd update on core-packages-team.

Hopefully we can easily address it.

Ludo’.
?
Your comment

This issue is archived.

To comment on this conversation send an email to 77610@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 77610
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch
You may also tag this issue. See list of standard tags. For example, to set the confirmed and easy tags
mumi command -t +confirmed -t +easy
Or, remove the moreinfo tag and set the help tag
mumi command -t -moreinfo -t +help