qa/cephfs: lua to respect missing kernel in yaml#66092
Conversation
3f159c9 to
acfb74a
Compare
qa/cephfs/begin/3-kernel.yaml
Outdated
| py_attrgetter(yaml.kernel).pop('koji_task', nil) | ||
| py_attrgetter(yaml.kernel).pop('rpm', nil) | ||
| py_attrgetter(yaml.kernel).pop('sha1', nil) | ||
| py_attrgetter(yaml.kernel).pop('tag', nil) |
There was a problem hiding this comment.
@batrick Wonder if the kernel itself should be popped out from yaml too?
There was a problem hiding this comment.
I believe I didn't do that because it broke the teuthology task, IIRC.
There was a problem hiding this comment.
@batrick Wonder if the
kernelitself should be popped out from yaml too?
Circling back: no. The reason is because the kernel config at the time this postmerge script runs will look something like (without -k):
with k-testing:
kernel:
branch: distro
...
client:
branch: testing
or with k-stock:
kernel:
branch: distro
...
client:
branch: distro
The point of this postmerge script is to avoid installing/updating the kernel on non-client nodes. It took me a while to remember this so please update the comment so it's clear in the future.
|
jenkins test docs |
1 similar comment
|
jenkins test docs |
|
the docs fails because of https://tracker.ceph.com/issues/73645 |
batrick
left a comment
There was a problem hiding this comment.
When teuthology-suite is called with '-k none' option, which is valid, there is no kernel record in job config created.
... why is -k none valid? Should it not be "default" (unset) or some real branch?
qa/cephfs/begin/3-kernel.yaml
Outdated
| py_attrgetter(yaml.kernel).pop('koji_task', nil) | ||
| py_attrgetter(yaml.kernel).pop('rpm', nil) | ||
| py_attrgetter(yaml.kernel).pop('sha1', nil) | ||
| py_attrgetter(yaml.kernel).pop('tag', nil) |
There was a problem hiding this comment.
I believe I didn't do that because it broke the teuthology task, IIRC.
because by default |
That's useful and in fact what this fragment does. |
I meant, the kernel task is useless in that case, but disabling of it is useful of course :-D |
batrick
left a comment
There was a problem hiding this comment.
Please add a comment that the new check is specifically for -k none.
61493cc to
be7f766
Compare
batrick
left a comment
There was a problem hiding this comment.
There are other premerge/postmerge scripts that need to be updated to be tolerant of yaml.kernel == nil.
# f() { teuthology-suite -t main --machine-type smithi --email pdonnell@redhat.com --suite fs -p 75 --force-priority --subset $((RANDOM % 1024))/1024 --ceph wip-pdonnell-testing-20251126.180742-debug --filter-out upgrade,ubuntu --filter k-testing -k none --limit 1 --verbose ; } ; which teuthology-suite && ((sleep 0h && f) || (sleep 1h && f) || (sleep 1h && f) || (sleep 1h && f) || (sleep 90m && f))
...
2025-12-05 18:19:05,050.050 DEBUG:teuthology.suite.merge:skipping config fs/upgrade/mds_upgrade_sequence/{bluestore-bitmap conf/{client mds mgr mon osd} distro fail_fs/yes kernel overrides/{ignorelist_health ignorelist_upgrade ignorelist_wrongly_marked_down pg-warn pg_health syntax} roles tasks/{0-from/squid 1-volume/{0-create 1-ranks/2 2-allow_standby_replay/no 3-inline/yes 4-verify} 2-client/kclient 3-upgrade-mgr-staggered 4-config-upgrade/{fail_fs} 5-upgrade-with-workload 6-verify}} due to postmerge filter
2025-12-05 18:19:05,050.050 DEBUG:teuthology.suite.merge:merging config workload/{0-distro begin/{0-install 1-cephadm 2-logrotate 3-kernel} clusters/1a11s-mds-1c-client-3node conf/{client mds mgr mon osd} mount/kclient/{base/{mount-syntax/{v1} mount overrides/{distro/testing/k-testing ms-die-on-skipped}} ms_mode/legacy wsync/no} objectstore-ec/bluestore-comp-ec-root omap_limit/10 overrides/{cephsqlite-timeout frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts pg_health session_timeout} ranks/multi/{balancer/automatic export-check n/3 replication/default} standby-replay tasks/{0-fscrypt/{with-fscrypt-dummy} 0-subvolume/{with-namespace-isolated} 0-use-snaprealm/{subvol} 1-check-counter 2-scrub/yes 3-snaps/yes 4-flush/no 5-quiesce/no 6-workunit/suites/pjd}}
2025-12-05 18:19:05,052.052 DEBUG:teuthology.suite.merge:premerge script running:
log.debug("base kernel %s", base_config.kernel)
local kernel = base_config.kernel
if kernel.branch ~= "distro" then
log.debug("overriding testing kernel with %s", kernel)
yaml_fragment.kernel.client = kernel
end
2025-12-05 18:19:05,052.052 DEBUG:teuthology.suite.merge:base kernel None
Traceback (most recent call last):
File "/cephfs/home/pdonnell/teuthology/virtualenv/bin/teuthology-suite", line 7, in <module>
sys.exit(main())
File "/cephfs/home/pdonnell/teuthology/scripts/suite.py", line 232, in main
return teuthology.suite.main(args)
File "/cephfs/home/pdonnell/teuthology/teuthology/suite/__init__.py", line 140, in main
run.prepare_and_schedule()
File "/cephfs/home/pdonnell/teuthology/teuthology/suite/run.py", line 467, in prepare_and_schedule
num_jobs = self.schedule_suite()
File "/cephfs/home/pdonnell/teuthology/teuthology/suite/run.py", line 636, in schedule_suite
configs = list(config_merge(configs, **config_merge_kwargs))
File "/cephfs/home/pdonnell/teuthology/teuthology/suite/merge.py", line 157, in config_merge
if not script():
File "lupa/lua54.pyx", line 946, in lupa.lua54._LuaObject.__call__
File "lupa/lua54.pyx", line 1918, in lupa.lua54.call_lua
File "lupa/lua54.pyx", line 1945, in lupa.lua54.execute_lua_call
File "lupa/lua54.pyx", line 1826, in lupa.lua54.raise_lua_error
lupa.lua54.LuaError: [string "teuthology"]:3: attempt to index a nil value (local 'kernel')
stack traceback:
[C]: in ?
^C
It is just a link to the same file, isn't it? |
|
Ah... it |
|
@batrick would the following logic be correct in this case: |
Yes, I think so. Please ensure it can schedule before requesting a new review. |
|
Tried this command: And: |
|
@batrick Can we merge this already or we still need some special testing? |
Needs qa'd by CephFS. cc @vshankar @rishabh-d-dave @mchangir |
I would at least like to see if the test jobs are chosen as expected. |
How can I help? |
I will get to testing this next week, but if you are willing to expedite, then scheduling a fs suite run (even with `--dry-run) and checking if the kclient jobs are picked up would suffice. |
|
Okay, seems like running with Gives another KeyError: Happens because of suite overrides kernel config over here: and similar: I can add another patch: But I wonder do we need to keep kernel.client.sha1, I am not sure that we need to deploy specific kernel when we query |
7bf01d6 to
cc2143d
Compare
|
@vshankar The run scheduled by teuthology crontabjob https://pulpito.ceph.com/teuthology-2025-12-01_20:24:04-fs-main-distro-default-smithi/ has 345 jobs. Same result without providing |
batrick
left a comment
There was a problem hiding this comment.
Okay, seems like running with
-k noneagainst mainfs:thrashGives another KeyError:
File "lupa/lua54.pyx", line 921, in lupa.lua54._LuaObject.__call__ File "lupa/lua54.pyx", line 1886, in lupa.lua54.call_lua File "lupa/lua54.pyx", line 1912, in lupa.lua54.execute_lua_call File "lupa/lua54.pyx", line 391, in lupa.lua54.LuaRuntime.reraise_on_exception File "lupa/lua54.pyx", line 2220, in lupa.lua54.py_object_getindex_with_gil File "lupa/lua54.pyx", line 2182, in lupa.lua54.getitem_for_lua KeyError: 'branch'Happens because of suite overrides kernel config over here: https://github.com/ceph/ceph/blob/main/qa/cephfs/mount/kclient/overrides/distro/stock/k-stock.yaml
kernel: client: sha1: distro ktype: distroand similar: https://github.com/ceph/ceph/blob/main/qa/cephfs/mount/kclient/overrides/distro/testing/k-testing.yaml
I can add another patch:
--- a/qa/cephfs/begin/3-kernel.yaml +++ b/qa/cephfs/begin/3-kernel.yaml @@ -12,7 +12,7 @@ teuthology: - | local kernel = py_attrgetter(yaml).get('kernel') if kernel ~= nil then - local branch = yaml.kernel.branch + local branch = py_attrgetter(yaml.kernel).get('branch') if branch and not yaml.kernel.branch:find "-all$" then log.debug("removing default kernel specification: %s", yaml.kernel) py_attrgetter(yaml.kernel).pop('branch', nil)
I think the right solution is to update the k-stock and k-testing fragments to also set the branch to distro.
But I wonder do we need to keep kernel.client.sha1, I am not sure that we need to deploy specific kernel when we query
-k none, but in this case, we need to checkkernel.branchfor presence when we addkerneltask. @batrick what do you think?
Even if -k none is specified, the k-testing and k-stock yamls should set the kernel for client nodes.
| if kernel.branch ~= "distro" then | ||
| if kernel == nil then | ||
| yaml_fragment.kernel = nil | ||
| elseif kernel.branch ~= "distro" then |
There was a problem hiding this comment.
I do not believe this is correct. This code exists to allow -k wip-kernel-branch to override the default of testing for this YAML fragment. If the scheduler specifies -k none, we still want to install the testing kernel for this fragment.
There was a problem hiding this comment.
Interesting, how is this supposed to work, usually, -k none is provided when there is no corresponding kernel build to avoid kernel task failing. Shouldn't we exclude this test case from the schedule list?
There was a problem hiding this comment.
anyways, I just use this condition:
if kernel ~= nil and kernel.branch ~= "distro" then
.
.
.
There was a problem hiding this comment.
Interesting, how is this supposed to work, usually,
-k noneis provided when there is no corresponding kernel build...
I don't know of a real situation where that would be true.
There was a problem hiding this comment.
This may happen when a suite is scheduled against a build repo built outside of sepia lab.
qa/cephfs/begin/3-kernel.yaml
Outdated
| py_attrgetter(yaml.kernel).pop('koji_task', nil) | ||
| py_attrgetter(yaml.kernel).pop('rpm', nil) | ||
| py_attrgetter(yaml.kernel).pop('sha1', nil) | ||
| py_attrgetter(yaml.kernel).pop('tag', nil) |
There was a problem hiding this comment.
@batrick Wonder if the
kernelitself should be popped out from yaml too?
Circling back: no. The reason is because the kernel config at the time this postmerge script runs will look something like (without -k):
with k-testing:
kernel:
branch: distro
...
client:
branch: testing
or with k-stock:
kernel:
branch: distro
...
client:
branch: distro
The point of this postmerge script is to avoid installing/updating the kernel on non-client nodes. It took me a while to remember this so please update the comment so it's clear in the future.
cc2143d to
2354fbd
Compare
When teuthology-suite is called with '-k none' option, which is valid, there is no kernel record in job config created. However at some test cases the lua premerge dies with exception: KeyError: 'kernel' and when branch is not set for '-k none' and kernel client is overridden: KeyError: 'branch' so teuthology-suite quits unexpectedly without scheduling any jobs. Fixes: https://tracker.ceph.com/issues/73676 Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@clyso.com>
2354fbd to
b410701
Compare
|
Hey @batrick could you please take another look, thanks. |
|
@vshankar are we good to merge it? I'd love to move forward with backports |
I'm fine with merging this. Please feel free to merge it or I will merge it by EOD. |
When teuthology-suite is called with '-k none' option, which is valid, there is no kernel record in job config created.
However at some test cases the lua premerge dies with exception:
KeyError: 'kernel'
and teuthology-suite quits unexpectedly without scheduling any jobs.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins test classic perfJenkins Job | Jenkins Job Definitionjenkins test crimson perfJenkins Job | Jenkins Job Definitionjenkins test signedJenkins Job | Jenkins Job Definitionjenkins test make checkJenkins Job | Jenkins Job Definitionjenkins test make check arm64Jenkins Job | Jenkins Job Definitionjenkins test submodulesJenkins Job | Jenkins Job Definitionjenkins test dashboardJenkins Job | Jenkins Job Definitionjenkins test dashboard cephadmJenkins Job | Jenkins Job Definitionjenkins test apiJenkins Job | Jenkins Job Definitionjenkins test docsReadTheDocs | Github Workflow Definitionjenkins test ceph-volume allJenkins Jobs | Jenkins Jobs Definitionjenkins test windowsJenkins Job | Jenkins Job Definitionjenkins test rook e2eJenkins Job | Jenkins Job DefinitionYou must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.