udev: use bfq as the default scheduler by keszybz · Pull Request #13321 · systemd/systemd

keszybz · 2019-08-14T14:17:08Z

As requested in https://bugzilla.redhat.com/show_bug.cgi?id=1738828.
Test results are that bfq seems to behave better and more consistently on
typical hardware. The kernel does not have a configuration option to set
the default scheduler, and it currently needs to be set by userspace.

See the bug for more discussion and links.

As requested in https://bugzilla.redhat.com/show_bug.cgi?id=1738828. Test results are that bfq seems to behave better and more consistently on typical hardware. The kernel does not have a configuration option to set the default scheduler, and it currently needs to be set by userspace. See the bug for more discussion and links.

paolo-github · 2019-08-14T17:10:14Z

The rule seems to miss mmc devices. The latter benefit a lot from BFQ. For example, BFQ is the new default IO scheduler in Chromebooks running chromeos-4.19 kernels.

The rule doesn't consider nvme devices either, but, in this respect, I'd agree with testing this change with only single-queue devices first.

keszybz · 2019-08-16T13:34:49Z

What about stuff like /dev/loop* that writes to an underlying devices, and virtualized stuff like /dev/vd*, /dev/xvd* or networked like /dev/nbd* ?
Does it need a scheduler at all or would scheduler=none be better?

Algodev-github · 2019-08-16T17:37:06Z

IIRC, all of this have only none as available I/O scheduler.

…

Il giorno 16 ago 2019, alle ore 15:35, Zbigniew Jędrzejewski-Szmek ***@***.***> ha scritto: What about stuff like /dev/loop* that writes to an underlying devices, and virtualized stuff like /dev/vd*, /dev/xvd* or networked like /dev/nbd* ? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

keszybz · 2019-08-17T13:31:55Z

$ grep . /sys/class/block/*/queue/scheduler
/sys/class/block/dm-0/queue/scheduler:none
/sys/class/block/dm-1/queue/scheduler:none
/sys/class/block/loop0/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/loop1/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/loop2/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd0/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd1/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd2/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/sda/queue/scheduler:mq-deadline kyber [bfq] none
/sys/class/block/sdb/queue/scheduler:mq-deadline kyber [bfq] none
$ uname -r
5.2.7-200.fc30.x86_64

(that's with this patch, which overrides the scheduler for sr* and sd* only).

Algodev-github · 2019-08-19T09:23:37Z

I didn't remember well. I thought this over a little bit. Most throughput-boosting heuristics of BFQ are for physical block devices, and would be worthless with virtual devices. But BFQ would however guarantee a very low latency even to the I/O done on a virtual or network device. And would make it possible to control such an I/O with weights (through the cgroups interface). So, yes, switching to BFQ also for these devices would be beneficial. Thanks for checking it. Paolo

…

Il giorno 17 ago 2019, alle ore 15:32, Zbigniew Jędrzejewski-Szmek ***@***.***> ha scritto: $ grep . /sys/class/block/*/queue/scheduler /sys/class/block/dm-0/queue/scheduler:none /sys/class/block/dm-1/queue/scheduler:none /sys/class/block/loop0/queue/scheduler:[mq-deadline] kyber bfq none /sys/class/block/loop1/queue/scheduler:[mq-deadline] kyber bfq none /sys/class/block/loop2/queue/scheduler:[mq-deadline] kyber bfq none /sys/class/block/nbd0/queue/scheduler:[mq-deadline] kyber bfq none /sys/class/block/nbd1/queue/scheduler:[mq-deadline] kyber bfq none /sys/class/block/nbd2/queue/scheduler:[mq-deadline] kyber bfq none /sys/class/block/sda/queue/scheduler:mq-deadline kyber [bfq] none /sys/class/block/sdb/queue/scheduler:mq-deadline kyber [bfq] none $ uname -r 5.2.7-200.fc30.x86_64 (that's with this patch, which overrides the scheduler for sr* and sd* only). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

poettering · 2019-08-19T09:46:10Z

So I am not sure how I feel about this. So far the sysctls we changed from the kernel defaults in systemd are pretty much in the territory of "we define our own execution environment, we can decide the semantics how it behaves". But selecting the IO scheduler sounds more like "tuning", i.e. everything works basically the same, there's no change in semantics, it just performs a bit better, and I am not sure we should be in the business of doing performance evaluation and pick what is best by some standards we don't understand... I mean, is bfq really universally better? is it that clear struck? If it is, why isn't the kernel changing the defaults anyway?

I mean, I can see why the kernel wants to be a bit more conservative with settings such as fs.protected_hardlinks, because it breaks compat with some cases. We can be more agressive there in systemd, but the selection of IO schedulers doesn't break compat, it just tweaks behaviour, afaics, so why not leave this to the kernel maintainers to change? Or at least leave it to your specific distro's maintainers to switch? Is there a kernel build-time option to pick the default IO scheduler?

There's also the problem that bfq breaks CPUWeight= currently, i.e. #13335. I am not sure we should merge a patch that trades a tiny bit of improvement against breakage of pretty relevant functionality just like that.

@paolo-github you are the bfq guy, right? Can you comment about this, and on #13335 please?

kakra · 2019-08-19T10:55:55Z

There's also the problem that bfq breaks CPUWeight= currently

@poettering I think this should've read IOWeight=?

kakra · 2019-08-19T11:03:46Z

Setting bfq as default via udev rules is problematic because it may not be available. It depends on whether the kernel defaults to using multi-queue or single-queue mode for device drivers, and even then some drivers may only support one or the other. bfq is not officially available for single-queue device drivers. This should really be a distro / maintainer choice unless the kernel dropped support for SQ.

The kernel itself doesn't support a default MQ scheduler, only SQ had a default choice.

paolo-github · 2019-08-19T11:36:36Z

Il giorno 19 ago 2019, alle ore 11:46, Lennart Poettering ***@***.***> ha scritto: So I am not sure how I feel about this. So far the sysctls we changed from the kernel defaults in systemd are pretty much in the territory of "we define our own execution environment, we can decide the semantics how it behaves". But selecting the IO scheduler sounds more like "tuning", i.e. everything works basically the same, there's no change in semantics, it just performs a bit better, and I am not sure we should be in the business of doing performance evaluation and pick what is best by some standards we don't understand... I mean, is bfq really universally better? is it that clear struck? If it is, why isn't the kernel changing the defaults anyway?

Because they say it's userspace that must do it. In short, that say this must be a per-device choice, and the kernel doesn't do that. We fought a hard fight on this. The above argument is flawed because, there is *always* a default, especially if an option with that name is not available any longer (and the actual, hardwired default is mq-deadline). The presence of such a default rightly triggers the following, systematic question from user-space people: if bfq is better, why is the default different? And this closes the infinite loop. (I'd prefer not to waste time looking again for all the involved threads, and linking them here. But, if needed, I'll do that too.)

I mean, I can see why the kernel wants to be a bit more conservative with settings such as fs.protected_hardlinks, because it breaks compat with some cases. We can be more agressive there in systemd, but the selection of IO schedulers doesn't break compat, it just tweaks behaviour, afaics, so why not leave this to the kernel maintainers to change?

They don't want to change it. The leave mq-deadline, because it is supposed to be better suited for the niche of enterprise storage doing millions of IOPS. Paradoxically, with millions of IOPS, the very in-kernel I/O handling is often too heavy; even with no I/O scheduling. The most recent evidence of this is the new io_uring effort. However this argument made maintainers change their mind either. One of the main practical reasons is that very few, if any, user-space people show up in these discussions (for many good reasons, I know). So it's basically our numbers (test results) against authority.

Or at least leave it to your specific distro's maintainers to switch?

That's the main path I'm following right now. The problem is that I'm only one, and convincing every distro (even just the most used ones) is proving to be very time consuming, slow and inefficient. Then Zbigniew made this pull request, which may actually speed up things incredibly.

Is there a kernel build-time option to pick the default IO scheduler?

Removed by the block-layer maintainer, with the motivations I reported above.

There's also the problem that bfq breaks CPUWeight= currently, i.e. #13335.

Hopefully no more. I'll reply on that thread.

I am not sure we should merge a patch that trades a tiny bit of improvement against breakage of pretty relevant functionality just like that.

No more breakage ahead. As for the 'tiny' improvements, these are some of the benefits provided by BFQ, on any type of storage medium (embedded flash storage, HDDs, SATA or NVMe SSDs, ...) and on systems ranging from minimal embedded systems to high-end servers: - Under load, BFQ loads applications up to 20X times as fast as any other I/O scheduler. In absolute terms, the system is virtually as responsive as if it was idle, regardless of the background I/O workload. As a concrete example, with writes as background workload on a Samsung SSD 970 PRO, gnome-terminal starts in 1.8 seconds with BFQ, and in at least 28.7 seconds with the other I/O schedulers [1]. - Soft real-time applications, such as audio and video players or audio- and video-streaming applications, enjoy smooth playback or streaming, regardless of the background I/O workload [1]. - In multi-client applications---i.e., when multiple clients, groups, containers, virtual machines or any other kind of entities compete for a shared medium---BFQ reaches from 5X to 10X higher throughput than any other solution for guaranteeing bandwidth to each entity competing for storage [2]. In addition, BFQ reaches up to 2X higher throughput than the other I/O schedulers on slow devices, and guarantees high throughput and responsiveness with code-development tasks. Links to demos and, in general, more details on BFQ's homepage [3]. [1] https://algo.ing.unimo.it/people/paolo/BFQ/results.php [2] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/ [3] https://algo.ing.unimo.it/people/paolo/BFQ

@paolo-github you are the bfq guy, right? Can you comment about this, and on #13335 please?

I think you've already guessed :) Thanks, Paolo

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Algodev-github · 2019-08-19T12:21:33Z

Il giorno 19 ago 2019, alle ore 13:04, Kai Krakow ***@***.***> ha scritto: Setting bfq as default via udev rules is problematic because it may not be available.

What about checking it before setting?

It depends on whether the kernel defaults to using multi-queue or single-queue mode for device drivers,

single-queue is no more, since 5.0 :)

and even then some drivers may only support one or the other.

all drivers now necessarily support multi-queue, as single-queue doesn't exist any longer

bfq is not officially available for single-queue device drivers. This should really be a distro / maintainer choice. The kernel itself doesn't support a default MQ scheduler, only SQ had a default choice.

Yep. Default as been removed in MQ. I've written the motivation in my previous comment. Paolo

…

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

kakra · 2019-08-19T13:20:40Z

Ah okay, I wasn't aware it has been removed in 5.0, some of my machines still run 4.19 stable. Thanks.

poettering · 2019-08-19T13:50:45Z

There's also the problem that bfq breaks CPUWeight= currently

@poettering I think this should've read IOWeight=?

yes, of course, sorry for the confusion

poettering · 2019-08-19T14:25:21Z

Because they say it's userspace that must do it. In short, that say this must be a per-device choice, and the kernel doesn't do that.

Well the kernel does this for network queuing disciplines (i.e. some devices are marked by their drivers as needing a fifo scheduler, and then get a different default than others), I really don't see why this wouldn't work for IO schedulers in a similar fashion too. The networking stack supports picking a default scheduler via the net.core.default_qdisc sysctl, and devices like CAN then can pick others within their driver code. Why is something that works fine for networking people not good enough block IO people?

I mean, this sounds like a stupid game of passing around responsibility for this stuff. But instead of implementing the default policy where it's simple and obvious (i.e. the kernel), and only doing the non-default, per-installation tweaks in userspace, kernel folks now can't agree with themselves and just pass the responsibility to userspace wholesale. I mean, we as systemd people have not much clue about elevators, I am not sure we really should be the ones making the decisions here what a good defaults is... It's like you come to a new city unknown to you, finding a taxi that shall bring you to your destination but then the taxi drivers asks you to lead the way, and you got no map...

I mean, I think doing policy in userspace is great if it actually involves stuff that userspace can do better, or that is per-installation tuning. But for frickin defaults, that apply everywhere, where all we do is echo some essentially constant nonsense back into the kernel when the kernel asks for it, why? just why? @axboe any chance you can reconsider this? why dumping such default policy choices on the doorstep of userspace? Why pick defaults that are apparently wrong for most relevant usecases and expecting userspace to clean up after you?

I mean, any solution involving udev is pretty ugly in general, because it means we always start the block devices with the wrong scheduler and the swap them out after some initial IO was already done on the device. Who does stuff like that? It's just plain ugly. The solution the networking people picked is nicely race-free as the defaults are picked by userspace beforehand with a sysctl or beforehand by the kernel drivers, but not when it's already too late by running userspace after the fact.

gah, this all sounds like a fragile, hacky, racy garbage that works around a social problem (kernel folks not being able to come to an agreement between themselves) and expectations that userspace is the trash dump for everything the kernel people want to avoid figuring out.

Seriously, this all should just work with a naked kernel, and userspace should not be involved in this for picking defaults.

poettering · 2019-08-19T14:40:46Z

One of the main practical reasons is that very few, if any, user-space people show up in these discussions (for many good reasons, I know).

Quite frankly, if there's something that an OS kernel should nicely abstract away from userspace it's an IO scheduler... I really don't want to be included in discussions about which scheduler is better, and certainly don't want to be the one making the decisions on this, as I really have no clue about IO schedulers. I mean, I am pretty sure you don't want my input on IO schedulers, it's not going to be much else than "I don't know". (Also I am very conservative with taking part on lkml discussions in general, it tends to result in tons of hate mail flowing my way as effect, hence if you add me on something that is also cc'ed to lkml I tune out right-away, because I really don't need that, the toxicity of the kernel community (in particular the fringes of of it) in this regard they can keep for themselves).

Algodev-github · 2019-08-19T14:44:48Z

Il giorno 19 ago 2019, alle ore 16:41, Lennart Poettering ***@***.***> ha scritto: One of the main practical reasons is that very few, if any, user-space people show up in these discussions (for many good reasons, I know). Quite frankly, if there's something that an OS kernel should nicely abstract away from userspace it's an IO scheduler... I really don't want to be included in discussions about which scheduler is better, and certainly don't want to be the one making the decisions on this, as I really have no clue about IO schedulers. I mean, I am pretty sure you don't want my input on IO schedulers, it's not going to be much else than "I don't know".

Your "I don't know" is my exact claim against the "userspace knows better" argument.

(Also I am very conservative with taking part on lkml discussions in general, it tends to result in tons of hate mail flowing my way as effect, hence if you add me on something that is also cc'ed to lkml I tune out right-away, because I really don't need that, the toxicity of the kernel community (in particular the fringes of of it) in this regard they can keep for themselves).

Yep. After you expressed this concern many months ago, I stopped CCing you. Thanks, Paolo

…

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

poettering · 2019-08-19T14:51:06Z

rules/60-block-scheduler.rules

@@ -0,0 +1,3 @@
+# do not edit this file, it will be overwritten on update
+
+ACTION=="add|change", KERNEL=="sd*[!0-9]|sr*", ATTR{queue/scheduler}="bfq"


there should be a comment here I figure, explaining the situation briefly. (And I'd really clarify that this is a technical solution for a political problem in the kernel community and that we believe this shouldn't be here).

I think this should also carry a SUBSYSTEM="block" check, no?

Also, needs a NEWS entry

Duh, I moved this to a separate file and forgot to copy all the "headers". So this would need to get fixed.

axboe · 2019-08-19T14:51:14Z

I'll make this brief. I'm not interested in making BFQ the default scheduler as the default scheduler needs to be both simple and 100% stable, and BFQ is neither. There's still too much churn and silly bugs that end up getting introduced and fixed. Maybe that'll change with time as confidence and stability grows, but right now is not that time.

I'm fine with having a way to set a scheduler in the kernel for cases that absolutely need it. That includes devices like SMR/zoned devices, where they are only supported by specific schedulers. Those should be loaded up with a compatible scheduler in the kernel, and not be punted to userspace.

poettering · 2019-08-19T14:56:27Z

@axboe ah, interesting. So you say the prospect is that eventually the take on bfq would change and it could become the default choice in upstream Linux kernels without any userspace interference, after all bugs are fixed and it proved itself?

So what's your opinion on downstream distributions (in particular Fedora) adopting it now? Too early? Or great, to get the testing needed to make it default?

i.e. if we merge this patch now, is this something that has a clear perspective of being something we can drop eventually?

(merging something that is a stopgap, that by adding it helps making it unnecessary makes this much more attractive to me)

axboe · 2019-08-19T15:00:35Z

I wouldn't merge this patch now because of the reasons I outlined, it really doesn't matter if it's systemd or the kernel making the same choice, the result is the same in the end. But that's totally up to you, And yes, it could change over time, I don't have a crystal ball and can't foresee how that will go :-)

As far as distros go, that's 100% up to them as well. I've stated my opinion on the matter, they are free to proceed as they wish as they are the ones handling the support in the end. More users is definitely a win for BFQ and will help to shake out issues and increase confidence in it.

poettering · 2019-08-19T15:29:20Z

@axboe ok, thank you very much for your input.

having heard from various folks now various opposing opinions makes me sure it shouldn't be the systemd folks who decide on this though... it's not clear cut at all, and be a distro choice primarily, and nothing we push for upstream systemd.

Algodev-github · 2019-08-19T15:34:18Z

Il giorno 19 ago 2019, alle ore 17:00, Jens Axboe ***@***.***> ha scritto: I wouldn't merge this patch now because of the reasons I outlined, it really doesn't matter if it's systemd or the kernel making the same choice, the result is the same in the end. But that's totally up to you, And yes, it could change over time, I don't have a crystal ball and can't foresee how that will go :-)

Although not so relevant to this discussion, as a BFQ developer I'm urged only to add that: 1) BFQ's performance is already confirmed by ten+ years of good results on an ever-growing set of easily repeatable tests 2) BFQ has suffered from a few bugs recently because we made a lot of non-trivial improvements

keszybz · 2019-08-19T18:31:20Z

After reading all the pros and cons, I think it is better if we make the decision downstream. In particular, Fedora has a much narrower range of supported kernels (e.g. right now the oldest we have 5.2.7 in F29), and the decision about the scheduler is strongly influenced by the kernel version. Current version of systemd tries to support kernels >= 3.13.

I'll put this in systemd in F31+.

keszybz · 2019-08-19T18:31:39Z

... and thank you all for input. It's very much appreciated.

Pairs with: ostreedev/ostree#2152 Be nice to concurrent processes; operating system updates are usually a background thing. See e.g. openshift/machine-config-operator#1897 ostreedev/ostree#2152 This option is most effective in combination with a block scheduler such as `bfq`, which is the systemd default since systemd/systemd#13321

keszybz added the udev label Aug 14, 2019

poettering added the needs-discussion 🤔 label Aug 19, 2019

poettering reviewed Aug 19, 2019

View reviewed changes

keszybz closed this Aug 19, 2019

keszybz mentioned this pull request Aug 19, 2019

udev: set "watch" for more devices #13360

Merged

cgwalters mentioned this pull request Sep 12, 2019

manifest: Add irqbalance coreos/fedora-coreos-config#158

Merged

cgwalters mentioned this pull request Sep 26, 2019

WIP set bfq scheduler as default openshift/machine-config-operator#1138

Closed

This was referenced Mar 3, 2020

data.h: use low latency BFQ for nvme clearlinux/clr-power-tweaks#17

Closed

clr-power-tweaks enhancements clearlinux/distribution#1690

Closed

cgwalters mentioned this pull request Jul 18, 2020

daemon: Use IOSchedulingClass=idle coreos/rpm-ostree#2164

Closed

		@@ -0,0 +1,3 @@
		# do not edit this file, it will be overwritten on update

		ACTION=="add\|change", KERNEL=="sd[!0-9]\|sr", ATTR{queue/scheduler}="bfq"

Uh oh!

Conversation

keszybz commented Aug 14, 2019

Uh oh!

paolo-github commented Aug 14, 2019

Uh oh!

keszybz commented Aug 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Algodev-github commented Aug 16, 2019 via email

Uh oh!

keszybz commented Aug 17, 2019

Uh oh!

Algodev-github commented Aug 19, 2019 via email

Uh oh!

poettering commented Aug 19, 2019

Uh oh!

kakra commented Aug 19, 2019

Uh oh!

kakra commented Aug 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paolo-github commented Aug 19, 2019 via email

Uh oh!

Algodev-github commented Aug 19, 2019 via email

Uh oh!

kakra commented Aug 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poettering commented Aug 19, 2019

Uh oh!

poettering commented Aug 19, 2019

Uh oh!

poettering commented Aug 19, 2019

Uh oh!

Algodev-github commented Aug 19, 2019 via email

Uh oh!

poettering Aug 19, 2019

Choose a reason for hiding this comment

Uh oh!

keszybz Aug 19, 2019

Choose a reason for hiding this comment

Uh oh!

axboe commented Aug 19, 2019

Uh oh!

poettering commented Aug 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

axboe commented Aug 19, 2019

Uh oh!

poettering commented Aug 19, 2019

Uh oh!

Algodev-github commented Aug 19, 2019 via email

Uh oh!

keszybz commented Aug 19, 2019

Uh oh!

keszybz commented Aug 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

keszybz commented Aug 16, 2019 •

edited

Loading

kakra commented Aug 19, 2019 •

edited

Loading

kakra commented Aug 19, 2019 •

edited

Loading

poettering commented Aug 19, 2019 •

edited

Loading