Skip to content

udev: use bfq as the default scheduler#13321

Closed
keszybz wants to merge 1 commit intosystemd:masterfrom
keszybz:bfq
Closed

udev: use bfq as the default scheduler#13321
keszybz wants to merge 1 commit intosystemd:masterfrom
keszybz:bfq

Conversation

@keszybz
Copy link
Member

@keszybz keszybz commented Aug 14, 2019

As requested in https://bugzilla.redhat.com/show_bug.cgi?id=1738828.
Test results are that bfq seems to behave better and more consistently on
typical hardware. The kernel does not have a configuration option to set
the default scheduler, and it currently needs to be set by userspace.

See the bug for more discussion and links.

As requested in https://bugzilla.redhat.com/show_bug.cgi?id=1738828.
Test results are that bfq seems to behave better and more consistently on
typical hardware. The kernel does not have a configuration option to set
the default scheduler, and it currently needs to be set by userspace.

See the bug for more discussion and links.
@keszybz keszybz added the udev label Aug 14, 2019
@paolo-github
Copy link

The rule seems to miss mmc devices. The latter benefit a lot from BFQ. For example, BFQ is the new default IO scheduler in Chromebooks running chromeos-4.19 kernels.

The rule doesn't consider nvme devices either, but, in this respect, I'd agree with testing this change with only single-queue devices first.

@keszybz
Copy link
Member Author

keszybz commented Aug 16, 2019

What about stuff like /dev/loop* that writes to an underlying devices, and virtualized stuff like /dev/vd*, /dev/xvd* or networked like /dev/nbd* ?
Does it need a scheduler at all or would scheduler=none be better?

@Algodev-github
Copy link

Algodev-github commented Aug 16, 2019 via email

@keszybz
Copy link
Member Author

keszybz commented Aug 17, 2019

$ grep . /sys/class/block/*/queue/scheduler
/sys/class/block/dm-0/queue/scheduler:none
/sys/class/block/dm-1/queue/scheduler:none
/sys/class/block/loop0/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/loop1/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/loop2/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd0/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd1/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/nbd2/queue/scheduler:[mq-deadline] kyber bfq none
/sys/class/block/sda/queue/scheduler:mq-deadline kyber [bfq] none
/sys/class/block/sdb/queue/scheduler:mq-deadline kyber [bfq] none
$ uname -r
5.2.7-200.fc30.x86_64

(that's with this patch, which overrides the scheduler for sr* and sd* only).

@Algodev-github
Copy link

Algodev-github commented Aug 19, 2019 via email

@poettering
Copy link
Member

So I am not sure how I feel about this. So far the sysctls we changed from the kernel defaults in systemd are pretty much in the territory of "we define our own execution environment, we can decide the semantics how it behaves". But selecting the IO scheduler sounds more like "tuning", i.e. everything works basically the same, there's no change in semantics, it just performs a bit better, and I am not sure we should be in the business of doing performance evaluation and pick what is best by some standards we don't understand... I mean, is bfq really universally better? is it that clear struck? If it is, why isn't the kernel changing the defaults anyway?

I mean, I can see why the kernel wants to be a bit more conservative with settings such as fs.protected_hardlinks, because it breaks compat with some cases. We can be more agressive there in systemd, but the selection of IO schedulers doesn't break compat, it just tweaks behaviour, afaics, so why not leave this to the kernel maintainers to change? Or at least leave it to your specific distro's maintainers to switch? Is there a kernel build-time option to pick the default IO scheduler?

There's also the problem that bfq breaks CPUWeight= currently, i.e. #13335. I am not sure we should merge a patch that trades a tiny bit of improvement against breakage of pretty relevant functionality just like that.

@paolo-github you are the bfq guy, right? Can you comment about this, and on #13335 please?

@kakra
Copy link
Contributor

kakra commented Aug 19, 2019

There's also the problem that bfq breaks CPUWeight= currently

@poettering I think this should've read IOWeight=?

@kakra
Copy link
Contributor

kakra commented Aug 19, 2019

Setting bfq as default via udev rules is problematic because it may not be available. It depends on whether the kernel defaults to using multi-queue or single-queue mode for device drivers, and even then some drivers may only support one or the other. bfq is not officially available for single-queue device drivers. This should really be a distro / maintainer choice unless the kernel dropped support for SQ.

The kernel itself doesn't support a default MQ scheduler, only SQ had a default choice.

@paolo-github
Copy link

paolo-github commented Aug 19, 2019 via email

@Algodev-github
Copy link

Algodev-github commented Aug 19, 2019 via email

@kakra
Copy link
Contributor

kakra commented Aug 19, 2019

Ah okay, I wasn't aware it has been removed in 5.0, some of my machines still run 4.19 stable. Thanks.

@poettering
Copy link
Member

There's also the problem that bfq breaks CPUWeight= currently

@poettering I think this should've read IOWeight=?

yes, of course, sorry for the confusion

@poettering
Copy link
Member

Because they say it's userspace that must do it. In short, that say this must be a per-device choice, and the kernel doesn't do that.

Well the kernel does this for network queuing disciplines (i.e. some devices are marked by their drivers as needing a fifo scheduler, and then get a different default than others), I really don't see why this wouldn't work for IO schedulers in a similar fashion too. The networking stack supports picking a default scheduler via the net.core.default_qdisc sysctl, and devices like CAN then can pick others within their driver code. Why is something that works fine for networking people not good enough block IO people?

I mean, this sounds like a stupid game of passing around responsibility for this stuff. But instead of implementing the default policy where it's simple and obvious (i.e. the kernel), and only doing the non-default, per-installation tweaks in userspace, kernel folks now can't agree with themselves and just pass the responsibility to userspace wholesale. I mean, we as systemd people have not much clue about elevators, I am not sure we really should be the ones making the decisions here what a good defaults is... It's like you come to a new city unknown to you, finding a taxi that shall bring you to your destination but then the taxi drivers asks you to lead the way, and you got no map...

I mean, I think doing policy in userspace is great if it actually involves stuff that userspace can do better, or that is per-installation tuning. But for frickin defaults, that apply everywhere, where all we do is echo some essentially constant nonsense back into the kernel when the kernel asks for it, why? just why? @axboe any chance you can reconsider this? why dumping such default policy choices on the doorstep of userspace? Why pick defaults that are apparently wrong for most relevant usecases and expecting userspace to clean up after you?

I mean, any solution involving udev is pretty ugly in general, because it means we always start the block devices with the wrong scheduler and the swap them out after some initial IO was already done on the device. Who does stuff like that? It's just plain ugly. The solution the networking people picked is nicely race-free as the defaults are picked by userspace beforehand with a sysctl or beforehand by the kernel drivers, but not when it's already too late by running userspace after the fact.

gah, this all sounds like a fragile, hacky, racy garbage that works around a social problem (kernel folks not being able to come to an agreement between themselves) and expectations that userspace is the trash dump for everything the kernel people want to avoid figuring out.

Seriously, this all should just work with a naked kernel, and userspace should not be involved in this for picking defaults.

@poettering
Copy link
Member

One of the main practical reasons is that very few, if any, user-space people show up in these discussions (for many good reasons, I know).

Quite frankly, if there's something that an OS kernel should nicely abstract away from userspace it's an IO scheduler... I really don't want to be included in discussions about which scheduler is better, and certainly don't want to be the one making the decisions on this, as I really have no clue about IO schedulers. I mean, I am pretty sure you don't want my input on IO schedulers, it's not going to be much else than "I don't know". (Also I am very conservative with taking part on lkml discussions in general, it tends to result in tons of hate mail flowing my way as effect, hence if you add me on something that is also cc'ed to lkml I tune out right-away, because I really don't need that, the toxicity of the kernel community (in particular the fringes of of it) in this regard they can keep for themselves).

@Algodev-github
Copy link

Algodev-github commented Aug 19, 2019 via email

@@ -0,0 +1,3 @@
# do not edit this file, it will be overwritten on update

ACTION=="add|change", KERNEL=="sd*[!0-9]|sr*", ATTR{queue/scheduler}="bfq"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a comment here I figure, explaining the situation briefly. (And I'd really clarify that this is a technical solution for a political problem in the kernel community and that we believe this shouldn't be here).

I think this should also carry a SUBSYSTEM="block" check, no?

Also, needs a NEWS entry

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duh, I moved this to a separate file and forgot to copy all the "headers". So this would need to get fixed.

@axboe
Copy link

axboe commented Aug 19, 2019

I'll make this brief. I'm not interested in making BFQ the default scheduler as the default scheduler needs to be both simple and 100% stable, and BFQ is neither. There's still too much churn and silly bugs that end up getting introduced and fixed. Maybe that'll change with time as confidence and stability grows, but right now is not that time.

I'm fine with having a way to set a scheduler in the kernel for cases that absolutely need it. That includes devices like SMR/zoned devices, where they are only supported by specific schedulers. Those should be loaded up with a compatible scheduler in the kernel, and not be punted to userspace.

@poettering
Copy link
Member

poettering commented Aug 19, 2019

@axboe ah, interesting. So you say the prospect is that eventually the take on bfq would change and it could become the default choice in upstream Linux kernels without any userspace interference, after all bugs are fixed and it proved itself?

So what's your opinion on downstream distributions (in particular Fedora) adopting it now? Too early? Or great, to get the testing needed to make it default?

i.e. if we merge this patch now, is this something that has a clear perspective of being something we can drop eventually?

(merging something that is a stopgap, that by adding it helps making it unnecessary makes this much more attractive to me)

@axboe
Copy link

axboe commented Aug 19, 2019

I wouldn't merge this patch now because of the reasons I outlined, it really doesn't matter if it's systemd or the kernel making the same choice, the result is the same in the end. But that's totally up to you, And yes, it could change over time, I don't have a crystal ball and can't foresee how that will go :-)

As far as distros go, that's 100% up to them as well. I've stated my opinion on the matter, they are free to proceed as they wish as they are the ones handling the support in the end. More users is definitely a win for BFQ and will help to shake out issues and increase confidence in it.

@poettering
Copy link
Member

@axboe ok, thank you very much for your input.

having heard from various folks now various opposing opinions makes me sure it shouldn't be the systemd folks who decide on this though... it's not clear cut at all, and be a distro choice primarily, and nothing we push for upstream systemd.

@Algodev-github
Copy link

Algodev-github commented Aug 19, 2019 via email

@keszybz
Copy link
Member Author

keszybz commented Aug 19, 2019

After reading all the pros and cons, I think it is better if we make the decision downstream. In particular, Fedora has a much narrower range of supported kernels (e.g. right now the oldest we have 5.2.7 in F29), and the decision about the scheduler is strongly influenced by the kernel version. Current version of systemd tries to support kernels >= 3.13.

I'll put this in systemd in F31+.

@keszybz keszybz closed this Aug 19, 2019
@keszybz
Copy link
Member Author

keszybz commented Aug 19, 2019

... and thank you all for input. It's very much appreciated.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this pull request Jul 18, 2020
Pairs with: ostreedev/ostree#2152

Be nice to concurrent processes; operating system updates
are usually a background thing.  See e.g.
openshift/machine-config-operator#1897
ostreedev/ostree#2152
This option is most effective in combination with
a block scheduler such as `bfq`, which is the systemd
default since systemd/systemd#13321
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants