The long road to getrandom() in glibc
This article brought to you by LWN subscribersThe GNU C library (glibc) 2.25 release is expected to be available at the beginning of February; among the new features in this release will be a wrapper for the Linux getrandom() system call. One might well wonder why getrandom() is only appearing in this release, given that kernel support arrived with the 3.17 release in 2014 and that the glibc project is supposed to be more receptive to new features these days. A look at the history of this particular change highlights some of the reasons why getting new features into glibc is still hard.Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.
Glibc remains a conservative project. There are a number of good reasons
for that, but it
does mean that developers proposing new features tend to run into
roadblocks; that has certainly happened with getrandom(). The
kernel's random
number subsystem maintainer, Ted Ts'o, has been known to complain about the
delay in support for this system call; he has suggested that "maybe the kernel
developers should support a libinux.a library that would allow us to bypass
glibc when they are being non-helpful
". Peter Gutmann resorted to channeling Sir Humphrey
Appleby when
describing the glibc project's approach to getrandom(). But what
really caused the delay here?
Glibc bug 17252, requesting the addition of getrandom(), was filed in August 2014, five days after the 3.17 kernel release. Glibc developer Joseph Myers responded twice in the following six months, suggesting that, if anybody wanted getrandom() in glibc, they would need to go onto the project's mailing list and work to drive the development forward. The first reason for the delay is thus simple: nobody stepped up to do the work.
One might wonder why it took so long for somebody to come along and implement a simple system-call wrapper. In its essence, the code that will appear in the 2.25 release is:
/* Write LENGTH bytes of randomness starting at BUFFER. Return 0 on
success and -1 on failure. */
ssize_t
getrandom (void *buffer, size_t length, unsigned int flags)
{
return SYSCALL_CANCEL (getrandom, buffer, length, flags);
}
Such a function does not seem particularly hard to write. The original patch for getrandom() support, finally posted by Florian Weimer in June 2016, was rather more complicated than that, though. Weimer, knowing that the glibc project is conservative and wants the library to work in almost all situations, attempted to cover every base he could think of. So the patch included documentation updates, test programs, and several other details that, in turn, led to a number of sticking points that surely slowed the eventual acceptance of the patch.
The first obstacle, though, had little to do with the patch itself; it was, instead, brought about by the project's reluctance to add wrappers for Linux-specific system calls at all. Glibc does not see itself as a Linux-specific project, so it naturally prefers standardized interfaces that can be supported on all systems. The project has sporadically discussed its policy around Linux-specific calls over the last couple of years. In 2015, Myers described it as:
A draft
policy for Linux-specific wrappers has existed since about then but,
lacking consensus in a strongly consensus-oriented project, it has never
achieved any sort of official status. Thus, even though this policy states
that system-call wrappers should be added by default in the absence of
reasons to the contrary, Roland McGrath responded to the initial patch posting with a
terse message saying: "You need to start with rationale justifying the
new nonstandard API and why it belongs in libc
". That justification
was not hard, given that a number of projects have been asking for this
wrapper, and that adding the BSD getentropy() interface on top of
it is easily done, but this challenge foreshadowed much of what was to
come.
A trickier question was: what should glibc do when running on pre-3.17 kernels (or non-Linux kernels) that lack getrandom() support? The initial patch included a set of emulation functions so that getrandom() calls would always work; they would read the data from /dev/random or /dev/urandom as appropriate. Doing so involved keeping open file descriptors to those devices (lest later calls fail if the application does a chroot()). But using file descriptors in libraries is always fraught with perils; applications may have their own ideas of which descriptors are available, or may simply run a loop closing all descriptors. So the code took pains to use high-numbered descriptors that applications presumably don't care about, and it used fstat() to ensure that the application had not closed and reopened its descriptors between calls.
This usage of file descriptors drew a number of comments; it is something that glibc tries to avoid whenever possible. After some discussion, it was concluded that glibc should provide only a wrapper for the system call, without emulation. If an application calls getrandom() on a kernel where that system call is not supported, the glibc wrapper will simply return ENOSYS and it will be up to the application to use a fallback. That decision removed a fair amount of code and one obstacle to merging.
In writing the patch, Weimer worried that there may be a number of applications out there with their own function called getrandom(), which may or may not provide the same interface and semantics as the glibc version. The prospect was especially troubling because a getrandom() call that does not actually return random data may not cause any visible problems in the application at all — until some attacker notices this behavior and exploits it. So he employed a bunch of macro and symbol-versioning trickery to detect and prevent confusion over which getrandom() function to use.
This feature, too, was unpopular; glibc does not normally add extra layers of protection around its symbols in this way. The tricks made it impossible to take the address of the function, among other things. After extensive discussion, Weimer backed down and removed the interposition protection, but he clearly was not entirely happy about it.
The most extensive argument, though, was over whether getrandom() should be a thread cancellation point. In other words, what should happen if pthread_cancel() is called on a thread that is currently blocked in getrandom()? The original patch did make getrandom() into a cancellation point; it still behaves that way in the version merged for 2.25, but it had to survive a lot of argument to get there.
Weimer wanted getrandom() to be a cancellation point because the system call can block indefinitely, even if it almost never blocks at all. The Python os.urandom() episode showed that this blocking can, in rare situations, cause real problems. So, he said, it should be possible for a cancellation-aware program to respond to an overly slow getrandom() call.
The objections here seemed to be, for the most part, objections to cancellation points in general. It is true that cancellation points are problematic in a number of ways. To the implementation issues one can add the fact that most programs are not cancellation-aware and may not respond well to a thread cancellation in an unexpected place. A version of getrandom() that adds a new cancellation point could thus lead to unfortunate behavior. Additionally, getrandom() is supposed to always succeed; the possibility of cancellation adds a failure mode that is not a part of the system call itself.
On the other hand, Carlos O'Donell argued that getrandom() is analogous to read() and thus should behave the same way; read() is a cancellation point. The argument went back and forth over months, and included detours into whether there should be a separate getrandom_nocancel() function or an additional "cancellation point please" argument to getrandom(). In the end, getrandom() remained an unconditional cancellation point. The BSD-compatible getentropy() implementation included in the patch is not a cancellation point, though.
With these issues resolved, the conversation came to a close on
December 12 when getrandom()
and getentropy() were merged into the glibc repository. A
feature that has been shipping in the Linux kernel for over two years will
finally be available to application developers without the need to create
special system-call wrappers. Now all that's left is all the other
Linux-specific system calls that still lack glibc wrappers.
