udev: use bfq as the default scheduler#13321
Conversation
As requested in https://bugzilla.redhat.com/show_bug.cgi?id=1738828. Test results are that bfq seems to behave better and more consistently on typical hardware. The kernel does not have a configuration option to set the default scheduler, and it currently needs to be set by userspace. See the bug for more discussion and links.
|
The rule seems to miss mmc devices. The latter benefit a lot from BFQ. For example, BFQ is the new default IO scheduler in Chromebooks running chromeos-4.19 kernels. The rule doesn't consider nvme devices either, but, in this respect, I'd agree with testing this change with only single-queue devices first. |
|
What about stuff like |
|
IIRC, all of this have only none as available I/O scheduler.
… Il giorno 16 ago 2019, alle ore 15:35, Zbigniew Jędrzejewski-Szmek ***@***.***> ha scritto:
What about stuff like /dev/loop* that writes to an underlying devices, and virtualized stuff like /dev/vd*, /dev/xvd* or networked like /dev/nbd* ?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
$ grep . /sys/class/block/*/queue/scheduler
/sys/class/block/dm-0/queue/scheduler:none
/sys/class/block/dm-1/queue/scheduler:none
/sys/class/block/loop0/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/loop1/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/loop2/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd0/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd1/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd2/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/sda/queue/scheduler:mq-deadline kyber [bfq] none
/sys/class/block/sdb/queue/scheduler:mq-deadline kyber [bfq] none
$ uname -r
5.2.7-200.fc30.x86_64(that's with this patch, which overrides the scheduler for |
|
I didn't remember well. I thought this over a little bit. Most
throughput-boosting heuristics of BFQ are for physical block devices,
and would be worthless with virtual devices. But BFQ would however
guarantee a very low latency even to the I/O done on a virtual or
network device. And would make it possible to control such an I/O
with weights (through the cgroups interface). So, yes, switching to
BFQ also for these devices would be beneficial. Thanks for checking
it.
Paolo
… Il giorno 17 ago 2019, alle ore 15:32, Zbigniew Jędrzejewski-Szmek ***@***.***> ha scritto:
$ grep . /sys/class/block/*/queue/scheduler
/sys/class/block/dm-0/queue/scheduler:none
/sys/class/block/dm-1/queue/scheduler:none
/sys/class/block/loop0/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/loop1/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/loop2/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd0/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd1/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd2/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/sda/queue/scheduler:mq-deadline kyber [bfq] none
/sys/class/block/sdb/queue/scheduler:mq-deadline kyber [bfq] none
$
uname -r
5.2.7-200.fc30.x86_64
(that's with this patch, which overrides the scheduler for sr* and sd* only).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
So I am not sure how I feel about this. So far the sysctls we changed from the kernel defaults in systemd are pretty much in the territory of "we define our own execution environment, we can decide the semantics how it behaves". But selecting the IO scheduler sounds more like "tuning", i.e. everything works basically the same, there's no change in semantics, it just performs a bit better, and I am not sure we should be in the business of doing performance evaluation and pick what is best by some standards we don't understand... I mean, is bfq really universally better? is it that clear struck? If it is, why isn't the kernel changing the defaults anyway? I mean, I can see why the kernel wants to be a bit more conservative with settings such as There's also the problem that bfq breaks CPUWeight= currently, i.e. #13335. I am not sure we should merge a patch that trades a tiny bit of improvement against breakage of pretty relevant functionality just like that. @paolo-github you are the bfq guy, right? Can you comment about this, and on #13335 please? |
@poettering I think this should've read |
|
Setting The kernel itself doesn't support a default MQ scheduler, only SQ had a default choice. |
|
Il giorno 19 ago 2019, alle ore 11:46, Lennart Poettering ***@***.***> ha scritto:
So I am not sure how I feel about this. So far the sysctls we changed from the kernel defaults in systemd are pretty much in the territory of "we define our own execution environment, we can decide the semantics how it behaves". But selecting the IO scheduler sounds more like "tuning", i.e. everything works basically the same, there's no change in semantics, it just performs a bit better, and I am not sure we should be in the business of doing performance evaluation and pick what is best by some standards we don't understand... I mean, is bfq really universally better? is it that clear struck? If it is, why isn't the kernel changing the defaults anyway?
Because they say it's userspace that must do it. In short, that say
this must be a per-device choice, and the kernel doesn't do that.
We fought a hard fight on this. The above argument is flawed because,
there is *always* a default, especially if an option with that name is
not available any longer (and the actual, hardwired default is
mq-deadline).
The presence of such a default rightly triggers the following,
systematic question from user-space people: if bfq is better, why is
the default different? And this closes the infinite loop.
(I'd prefer not to waste time looking again for all the involved
threads, and linking them here. But, if needed, I'll do that too.)
I mean, I can see why the kernel wants to be a bit more conservative with settings such as fs.protected_hardlinks, because it breaks compat with some cases. We can be more agressive there in systemd, but the selection of IO schedulers doesn't break compat, it just tweaks behaviour, afaics, so why not leave this to the kernel maintainers to change?
They don't want to change it. The leave mq-deadline, because it is
supposed to be better suited for the niche of enterprise storage doing
millions of IOPS. Paradoxically, with millions of IOPS, the very
in-kernel I/O handling is often too heavy; even with no I/O
scheduling. The most recent evidence of this is the new io_uring
effort.
However this argument made maintainers change their mind either. One
of the main practical reasons is that very few, if any, user-space
people show up in these discussions (for many good reasons, I know).
So it's basically our numbers (test results) against authority.
Or at least leave it to your specific distro's maintainers to switch?
That's the main path I'm following right now. The problem is that I'm
only one, and convincing every distro (even just the most used ones)
is proving to be very time consuming, slow and inefficient.
Then Zbigniew made this pull request, which may actually speed up things
incredibly.
Is there a kernel build-time option to pick the default IO scheduler?
Removed by the block-layer maintainer, with the motivations I reported
above.
There's also the problem that bfq breaks CPUWeight= currently, i.e. #13335.
Hopefully no more. I'll reply on that thread.
I am not sure we should merge a patch that trades a tiny bit of improvement against breakage of pretty relevant functionality just like that.
No more breakage ahead. As for the 'tiny' improvements, these are
some of the benefits provided by BFQ, on any type of storage medium
(embedded flash storage, HDDs, SATA or NVMe SSDs, ...) and on systems
ranging from minimal embedded systems to high-end servers:
- Under load, BFQ loads applications up to 20X times as fast as any
other I/O scheduler. In absolute terms, the system is virtually as
responsive as if it was idle, regardless of the background I/O
workload. As a concrete example, with writes as background workload
on a Samsung SSD 970 PRO, gnome-terminal starts in 1.8 seconds with
BFQ, and in at least 28.7 seconds with the other I/O schedulers [1].
- Soft real-time applications, such as audio and video players or
audio- and video-streaming applications, enjoy smooth playback or
streaming, regardless of the background I/O workload [1].
- In multi-client applications---i.e., when multiple clients, groups,
containers, virtual machines or any other kind of entities compete
for a shared medium---BFQ reaches from 5X to 10X higher throughput
than any other solution for guaranteeing bandwidth to each entity
competing for storage [2].
In addition, BFQ reaches up to 2X higher throughput than the other I/O
schedulers on slow devices, and guarantees high throughput and
responsiveness with code-development tasks. Links to demos and, in
general, more details on BFQ's homepage [3].
[1] https://algo.ing.unimo.it/people/paolo/BFQ/results.php
[2] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/
[3] https://algo.ing.unimo.it/people/paolo/BFQ
@paolo-github you are the bfq guy, right? Can you comment about this, and on #13335 please?
I think you've already guessed :)
Thanks,
Paolo
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
Il giorno 19 ago 2019, alle ore 13:04, Kai Krakow ***@***.***> ha scritto:
Setting bfq as default via udev rules is problematic because it may not be available.
What about checking it before setting?
It depends on whether the kernel defaults to using multi-queue or single-queue mode for device drivers,
single-queue is no more, since 5.0 :)
and even then some drivers may only support one or the other.
all drivers now necessarily support multi-queue, as single-queue doesn't exist any longer
bfq is not officially available for single-queue device drivers. This should really be a distro / maintainer choice.
The kernel itself doesn't support a default MQ scheduler, only SQ had a default choice.
Yep. Default as been removed in MQ. I've written the motivation in my previous comment.
Paolo
…
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
Ah okay, I wasn't aware it has been removed in 5.0, some of my machines still run 4.19 stable. Thanks. |
yes, of course, sorry for the confusion |
Well the kernel does this for network queuing disciplines (i.e. some devices are marked by their drivers as needing a fifo scheduler, and then get a different default than others), I really don't see why this wouldn't work for IO schedulers in a similar fashion too. The networking stack supports picking a default scheduler via the I mean, this sounds like a stupid game of passing around responsibility for this stuff. But instead of implementing the default policy where it's simple and obvious (i.e. the kernel), and only doing the non-default, per-installation tweaks in userspace, kernel folks now can't agree with themselves and just pass the responsibility to userspace wholesale. I mean, we as systemd people have not much clue about elevators, I am not sure we really should be the ones making the decisions here what a good defaults is... It's like you come to a new city unknown to you, finding a taxi that shall bring you to your destination but then the taxi drivers asks you to lead the way, and you got no map... I mean, I think doing policy in userspace is great if it actually involves stuff that userspace can do better, or that is per-installation tuning. But for frickin defaults, that apply everywhere, where all we do is echo some essentially constant nonsense back into the kernel when the kernel asks for it, why? just why? @axboe any chance you can reconsider this? why dumping such default policy choices on the doorstep of userspace? Why pick defaults that are apparently wrong for most relevant usecases and expecting userspace to clean up after you? I mean, any solution involving udev is pretty ugly in general, because it means we always start the block devices with the wrong scheduler and the swap them out after some initial IO was already done on the device. Who does stuff like that? It's just plain ugly. The solution the networking people picked is nicely race-free as the defaults are picked by userspace beforehand with a sysctl or beforehand by the kernel drivers, but not when it's already too late by running userspace after the fact. gah, this all sounds like a fragile, hacky, racy garbage that works around a social problem (kernel folks not being able to come to an agreement between themselves) and expectations that userspace is the trash dump for everything the kernel people want to avoid figuring out. Seriously, this all should just work with a naked kernel, and userspace should not be involved in this for picking defaults. |
Quite frankly, if there's something that an OS kernel should nicely abstract away from userspace it's an IO scheduler... I really don't want to be included in discussions about which scheduler is better, and certainly don't want to be the one making the decisions on this, as I really have no clue about IO schedulers. I mean, I am pretty sure you don't want my input on IO schedulers, it's not going to be much else than "I don't know". (Also I am very conservative with taking part on lkml discussions in general, it tends to result in tons of hate mail flowing my way as effect, hence if you add me on something that is also cc'ed to lkml I tune out right-away, because I really don't need that, the toxicity of the kernel community (in particular the fringes of of it) in this regard they can keep for themselves). |
|
Il giorno 19 ago 2019, alle ore 16:41, Lennart Poettering ***@***.***> ha scritto:
One of the main practical reasons is that very few, if any, user-space people show up in these discussions (for many good reasons, I know).
Quite frankly, if there's something that an OS kernel should nicely abstract away from userspace it's an IO scheduler... I really don't want to be included in discussions about which scheduler is better, and certainly don't want to be the one making the decisions on this, as I really have no clue about IO schedulers. I mean, I am pretty sure you don't want my input on IO schedulers, it's not going to be much else than "I don't know".
Your "I don't know" is my exact claim against the "userspace knows better" argument.
(Also I am very conservative with taking part on lkml discussions in general, it tends to result in tons of hate mail flowing my way as effect, hence if you add me on something that is also cc'ed to lkml I tune out right-away, because I really don't need that, the toxicity of the kernel community (in particular the fringes of of it) in this regard they can keep for themselves).
Yep. After you expressed this concern many months ago, I stopped CCing you.
Thanks,
Paolo
… —
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
| @@ -0,0 +1,3 @@ | |||
| # do not edit this file, it will be overwritten on update | |||
|
|
|||
| ACTION=="add|change", KERNEL=="sd*[!0-9]|sr*", ATTR{queue/scheduler}="bfq" | |||
There was a problem hiding this comment.
there should be a comment here I figure, explaining the situation briefly. (And I'd really clarify that this is a technical solution for a political problem in the kernel community and that we believe this shouldn't be here).
I think this should also carry a SUBSYSTEM="block" check, no?
Also, needs a NEWS entry
There was a problem hiding this comment.
Duh, I moved this to a separate file and forgot to copy all the "headers". So this would need to get fixed.
|
I'll make this brief. I'm not interested in making BFQ the default scheduler as the default scheduler needs to be both simple and 100% stable, and BFQ is neither. There's still too much churn and silly bugs that end up getting introduced and fixed. Maybe that'll change with time as confidence and stability grows, but right now is not that time. I'm fine with having a way to set a scheduler in the kernel for cases that absolutely need it. That includes devices like SMR/zoned devices, where they are only supported by specific schedulers. Those should be loaded up with a compatible scheduler in the kernel, and not be punted to userspace. |
|
@axboe ah, interesting. So you say the prospect is that eventually the take on bfq would change and it could become the default choice in upstream Linux kernels without any userspace interference, after all bugs are fixed and it proved itself? So what's your opinion on downstream distributions (in particular Fedora) adopting it now? Too early? Or great, to get the testing needed to make it default? i.e. if we merge this patch now, is this something that has a clear perspective of being something we can drop eventually? (merging something that is a stopgap, that by adding it helps making it unnecessary makes this much more attractive to me) |
|
I wouldn't merge this patch now because of the reasons I outlined, it really doesn't matter if it's systemd or the kernel making the same choice, the result is the same in the end. But that's totally up to you, And yes, it could change over time, I don't have a crystal ball and can't foresee how that will go :-) As far as distros go, that's 100% up to them as well. I've stated my opinion on the matter, they are free to proceed as they wish as they are the ones handling the support in the end. More users is definitely a win for BFQ and will help to shake out issues and increase confidence in it. |
|
@axboe ok, thank you very much for your input. having heard from various folks now various opposing opinions makes me sure it shouldn't be the systemd folks who decide on this though... it's not clear cut at all, and be a distro choice primarily, and nothing we push for upstream systemd. |
|
Il giorno 19 ago 2019, alle ore 17:00, Jens Axboe ***@***.***> ha scritto:
I wouldn't merge this patch now because of the reasons I outlined, it really doesn't matter if it's systemd or the kernel making the same choice, the result is the same in the end. But that's totally up to you, And yes, it could change over time, I don't have a crystal ball and can't foresee how that will go :-)
Although not so relevant to this discussion, as a BFQ developer I'm urged only to add that:
1) BFQ's performance is already confirmed by ten+ years of good results on an ever-growing set of easily repeatable tests
2) BFQ has suffered from a few bugs recently because we made a lot of non-trivial improvements
|
|
After reading all the pros and cons, I think it is better if we make the decision downstream. In particular, Fedora has a much narrower range of supported kernels (e.g. right now the oldest we have 5.2.7 in F29), and the decision about the scheduler is strongly influenced by the kernel version. Current version of systemd tries to support kernels >= 3.13. I'll put this in systemd in F31+. |
|
... and thank you all for input. It's very much appreciated. |
Pairs with: ostreedev/ostree#2152 Be nice to concurrent processes; operating system updates are usually a background thing. See e.g. openshift/machine-config-operator#1897 ostreedev/ostree#2152 This option is most effective in combination with a block scheduler such as `bfq`, which is the systemd default since systemd/systemd#13321
As requested in https://bugzilla.redhat.com/show_bug.cgi?id=1738828.
Test results are that bfq seems to behave better and more consistently on
typical hardware. The kernel does not have a configuration option to set
the default scheduler, and it currently needs to be set by userspace.
See the bug for more discussion and links.