Skip to content

cephadm: error trying to get ceph auth entry for crash daemon#35274

Merged
sebastian-philipp merged 1 commit intoceph:masterfrom
jmolmo:issue_45726
Jun 25, 2020
Merged

cephadm: error trying to get ceph auth entry for crash daemon#35274
sebastian-philipp merged 1 commit intoceph:masterfrom
jmolmo:issue_45726

Conversation

@jmolmo
Copy link
Member

@jmolmo jmolmo commented May 27, 2020

If your cluster has nodes with a <.> in the name. This will happen.

Is the case if your hosts use a FQDN.

It seems that we changed the name of the host when using the <get_unique_name> function... Maybe I miss something .. probably ... but why we needed to change the name of the host?

I have removed the "problematic lines"

Fixes: https://tracker.ceph.com/issues/45726

Details:

Extracted the line where we tried to get auth details for the crash daemon, and executed alone:

[ceph: root@ceph-node-00 /]# ceph auth get client.crash.ceph-node-00
Error ENOENT: failed to find client.crash.ceph-node-00 in keyring

Checking what entries we have with "crash":

[ceph: root@ceph-node-00 /]# ceph auth list | grep crash
installed auth entries:

client.crash.ceph-node-00.cephlab.com
	caps: [mgr] profile crash
	caps: [mon] profile crash
client.crash.ceph-node-01.cephlab.com
	caps: [mgr] profile crash
	caps: [mon] profile crash
client.crash.ceph-node-02.cephlab.com
	caps: [mgr] profile crash
	caps: [mon] profile crash

so.. it seems that we are searching using a wrong key (client.crash.ceph-node-00) instead the valid key (client.crash.ceph-node-00.cephlab.com)

Digging in the code results that we need "ename" in the daemonid...
and the daemon_id is wrong ...

"mgr/cephadm/host.ceph-node-00.cephlab.com":
"{\"daemons\": {\"mon.ceph-node-00.cephlab.com\": {\"hostname\": \"ceph-node-00.cephlab.com\",
\"container_id\": \"f1b3200819e7\", \"container_image_id\": \"79566b1db52fa6ab71e3436643a0f40c0d24c19b04eaebe2d322adbaa94c7c9a\", \"container_image_name\": \"docker.io/ceph/daemon-base:latest-master-devel\", 

\"daemon_id\": \"ceph-node-00.cephlab.com\",

 \"daemon_type\": \"mon\", \"version\": \"16.0.0-1574-gacefe44a49\", \"status\": 1, \"status_desc\": \"running\", \"last_refresh\": \"2020-05-21T09:57:48.428738\", \"created\": \"2020-05-21T09:32:45.075442\", \"started\": \"2020-05-21T09:32:48.270246\"}, \"mgr.ceph-node-00.cephlab.com.tqtzkd\": {\"hostname\": \"ceph-node-00.cephlab.com\", \"container_id\": \"2d756ba85004\", \"container_image_id\": \"79566b1db52fa6ab71e3436643a0f40c0d24c19b04eaebe2d322adbaa94c7c9a\", \"container_image_name\": \"docker.io/ceph/daemon-base:latest-master-devel\", \"daemon_id\": \"ceph-node-00.cephlab.com.tqtzkd\", \"daemon_type\": \"mgr\", \"version\": \"16.0.0-1574-gacefe44a49\", \"status\": 1, \"status_desc\": \"running\", \"last_refresh\": \"2020-05-21T09:57:48.428847\", \"created\": \"2020-05-21T09:32:49.420484\", \"started\": \"2020-05-21T09:32:49.501113\"}, \"alertmanager.ceph-node-00\": {\"hostname\": \"ceph-node-00.cephlab.com\", \"container_id\": \"3b96e076fbe9\", \"container_image_id\": \"0881eb8f169f5556a292b4e2c01d683172b12830a62a9225a98a8e206bb734f0\", \"container_image_name\": \"docker.io/prom/alertmanager:latest\", \"daemon_id\": \"ceph-node-00\", \"daemon_type\": \"alertmanager\", \"version\": \"0.20.0\", \"status\": 1, \"status_desc\": \"running\", \"last_refresh\": \"2020-05-21T09:57:48.428899\", \"created\": \"2020-05-21T09:33:58.626047\", \"started\": \"2020-05-21T09:38:05.240100\"}, \"crash.ceph-node-00\": {\"hostname\": \"ceph-node-00.cephlab.com\", \"container_id\": \"56148085cc2d\", \"container_image_id\": \"79566b1db52fa6ab71e3436643a0f40c0d24c19b04eaebe2d322adbaa94c7c9a\", \"container_image_name\": \"docker.io/ceph/daemon-base:latest-master-devel\", \"daemon_id\": \"ceph-node-00\", \"daemon_type\": \"crash\", \"version\": \"16.0.0-1574-gacefe44a49\
```", 

And the daemon id is not generated properly **because** the `get_unique_name` function changes the real name of the host (removing all starting in the firts point)

Signed-off-by: Juan Miguel Olmo Martínez <jolmomar@redhat.com>

## Checklist
- [x] References tracker ticket
- [ ] Updates documentation if necessary
- [ ] Includes tests for new functionality or reproducer for bug

@jmolmo jmolmo added the cephadm label May 27, 2020
@jmolmo jmolmo requested a review from a team May 27, 2020 11:41
@sebastian-philipp
Copy link
Contributor

This would generate a big chaos of existing daemons using the bare name and new daemons using the fqdn.

I think changing this for octopus is very late now.

Do you think we can also fix this by improving

def name_to_auth_entity(name) -> str:
"""
Map from daemon names to ceph entity names (as seen in config)
"""
daemon_type = name.split('.', 1)[0]
if daemon_type in ['rgw', 'rbd-mirror', 'nfs', 'crash', 'iscsi']:
return 'client.' + name
elif daemon_type == 'mon':
return 'mon.'
elif daemon_type in ['osd', 'mds', 'mgr', 'client']:
return name
else:
raise OrchestratorError("unknown auth entity name")

?

@jmolmo
Copy link
Member Author

jmolmo commented May 27, 2020

This would generate a big chaos of existing daemons using the bare name and new daemons using the fqdn.

I think changing this for octopus is very late now.

Do you think we can also fix this by improving

def name_to_auth_entity(name) -> str:
"""
Map from daemon names to ceph entity names (as seen in config)
"""
daemon_type = name.split('.', 1)[0]
if daemon_type in ['rgw', 'rbd-mirror', 'nfs', 'crash', 'iscsi']:
return 'client.' + name
elif daemon_type == 'mon':
return 'mon.'
elif daemon_type in ['osd', 'mds', 'mgr', 'client']:
return name
else:
raise OrchestratorError("unknown auth entity name")

?

I thing that not using the real name of the host is going to cause more problems in the future. Besides that, this adds the need to use a "consensus" about when to use the real name or the abbreviated name... another source of problems.
We can try to "patch", but in my view this kind of things are like a small snow ball running down hill.

@sebastian-philipp
Copy link
Contributor

don't know. Ceph already prefers bare host names. Imagine if you simply want to change the domain of a cluster. That would be extremely complicated, if we add this to the daemon names.

@sebastian-philipp
Copy link
Contributor

I have the feeling that doing this radical change of creating new daemons with a different naming scheme requires some more thought. Might be something to do for pacific?

@jmolmo
Copy link
Member Author

jmolmo commented Jun 1, 2020

I have the feeling that doing this radical change of creating new daemons with a different naming scheme requires some more thought. Might be something to do for pacific?

Avoiding the error with a not wonderful trick. (In spanish we call this kind of things "ñapa")
and Yes..probably we can leave it for pacific....

@jmolmo jmolmo mentioned this pull request Jun 2, 2020
3 tasks
Comment on lines +1649 to +1651
partial_auth_key = daemon_id
if daemon_type == 'crash': #
partial_auth_key = host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this code into utils.name_to_auth_entity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Note: It seems that mgr also changed his auth key

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@jmolmo
Copy link
Member Author

jmolmo commented Jun 4, 2020

Solved a couple of tricky collateral "mypy" issues.

@sebastian-philipp sebastian-philipp added the wip-swagner-testing My Teuthology tests label Jun 5, 2020
@sebastian-philipp
Copy link
Contributor

http://pulpito.ceph.com/swagner-2020-06-05_13:12:52-rados:cephadm-wip-swagner-testing-2020-06-05-1139-distro-basic-smithi/5119560/

2020-06-05T13:42:59.767 INFO:teuthology.orchestra.run.smithi117.stderr:Error EINVAL: Traceback (most recent call last):
2020-06-05T13:42:59.767 INFO:teuthology.orchestra.run.smithi117.stderr:  File "/usr/share/ceph/mgr/mgr_module.py", line 1171, in _handle_command
2020-06-05T13:42:59.767 INFO:teuthology.orchestra.run.smithi117.stderr:    return self.handle_command(inbuf, cmd)
2020-06-05T13:42:59.768 INFO:teuthology.orchestra.run.smithi117.stderr:  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 113, in handle_command
2020-06-05T13:42:59.768 INFO:teuthology.orchestra.run.smithi117.stderr:    return dispatch[cmd['prefix']].call(self, cmd, inbuf)
2020-06-05T13:42:59.768 INFO:teuthology.orchestra.run.smithi117.stderr:  File "/usr/share/ceph/mgr/mgr_module.py", line 311, in call
2020-06-05T13:42:59.768 INFO:teuthology.orchestra.run.smithi117.stderr:    return self.func(mgr, **kwargs)
2020-06-05T13:42:59.768 INFO:teuthology.orchestra.run.smithi117.stderr:  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 75, in <lambda>
2020-06-05T13:42:59.769 INFO:teuthology.orchestra.run.smithi117.stderr:    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
2020-06-05T13:42:59.769 INFO:teuthology.orchestra.run.smithi117.stderr:  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 66, in wrapper
2020-06-05T13:42:59.769 INFO:teuthology.orchestra.run.smithi117.stderr:    return func(*args, **kwargs)
2020-06-05T13:42:59.769 INFO:teuthology.orchestra.run.smithi117.stderr:  File "/usr/share/ceph/mgr/orchestrator/module.py", line 715, in _daemon_add_osd
2020-06-05T13:42:59.769 INFO:teuthology.orchestra.run.smithi117.stderr:    raise_if_exception(completion)
2020-06-05T13:42:59.769 INFO:teuthology.orchestra.run.smithi117.stderr:  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 633, in raise_if_exception
2020-06-05T13:42:59.770 INFO:teuthology.orchestra.run.smithi117.stderr:    raise e
2020-06-05T13:42:59.770 INFO:teuthology.orchestra.run.smithi117.stderr:mgr_module.MonCommandFailed: auth get failed: invalid entity_auth 0 retval: -22
2020-06-05T13:42:59.770 INFO:teuthology.orchestra.run.smithi117.stderr:

2020-06-05T13:42:59.770 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: File "/usr/share/ceph/mgr/mgr_module.py", line 1096, in check_mon_command
2020-06-05T13:42:59.771 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: raise MonCommandFailed(f'{cmd_dict["prefix"]} failed: {r.stderr} retval: {r.retval}')
2020-06-05T13:42:59.771 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: mgr_module.MonCommandFailed: auth get failed: invalid entity_auth 0 retval: -22
2020-06-05T13:42:59.771 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: debug 2020-06-05T13:42:59.757+0000 7f5dad2b1700 -1 mgr handle_command module 'orchestrator' command handler threw exception: auth get failed: invalid entity_auth 0 retval: -22
2020-06-05T13:42:59.772 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: debug 2020-06-05T13:42:59.761+0000 7f5dad2b1700 -1 mgr.server reply reply (22) Invalid argument Traceback (most recent call last):
2020-06-05T13:42:59.772 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: File "/usr/share/ceph/mgr/mgr_module.py", line 1171, in _handle_command
2020-06-05T13:42:59.773 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: return self.handle_command(inbuf, cmd)
2020-06-05T13:42:59.773 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 113, in handle_command
2020-06-05T13:42:59.773 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: return dispatch[cmd['prefix']].call(self, cmd, inbuf)
2020-06-05T13:42:59.773 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: File "/usr/share/ceph/mgr/mgr_module.py", line 311, in call
2020-06-05T13:42:59.774 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: return self.func(mgr, **kwargs)
2020-06-05T13:42:59.774 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 75, in <lambda>
2020-06-05T13:42:59.774 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
2020-06-05T13:42:59.774 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 66, in wrapper
2020-06-05T13:42:59.775 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: return func(*args, **kwargs)
2020-06-05T13:42:59.775 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: File "/usr/share/ceph/mgr/orchestrator/module.py", line 715, in _daemon_add_osd
2020-06-05T13:42:59.775 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: raise_if_exception(completion)
2020-06-05T13:42:59.776 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 633, in raise_if_exception
2020-06-05T13:42:59.776 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: raise e
2020-06-05T13:42:59.776 INFO:ceph.mgr.x.smithi006.stdout:Jun 05 13:42:59 smithi006 bash[7715]: mgr_module.MonCommandFailed: auth get failed: invalid entity_auth 0 retval: -22
2020-06-05T13:43:00.185 DEBUG:teuthology.orchestra.run:got remote process result: 22
2020-06-05T13:43:00.186 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 32, in nested
    vars.append(enter())
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-swagner-testing-2020-06-05-1139/qa/tasks/cephadm.py", line 618, in ceph_osds
    remote.shortname + ':' + short_dev
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-swagner-testing-2020-06-05-1139/qa/tasks/cephadm.py", line 47, in _shell
    **kwargs
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 206, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 475, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed on smithi117 with status 22: 'sudo /home/ubuntu/cephtest/cephadm --image quay.io/ceph-ci/ceph:ef66e1bc4d611e10aee43b698f822996673b3fe4 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid fcbd512a-a731-11ea-a06b-001a4aab830c -- ceph orch daemon add osd smithi117:vg_nvme/lv_4'
2020-06-05T13:43:00.187 INFO:teuthology.orchestra.run.smithi006:> sudo rm -f /etc/ceph/ceph.conf /etc/ceph/ceph.client.admin.keyring
2020-06-05T13:43:00.345 INFO:teuthology.orchestra.run.smithi117:> sudo rm -f /etc/ceph/ceph.conf /etc/ceph/ceph.client.admin.keyring
2020-06-05T13:43:00.511 INFO:tasks.cephadm:Cleaning up testdir ceph.* files...
2020-06-05T13:43:00.511 INFO:teuthology.orchestra.run.smithi006:> rm -f /home/ubuntu/cephtest/seed.ceph.conf /home/ubuntu/cephtest/ceph.pub
2020-06-05T13:43:00.606 INFO:teuthology.orchestra.run.smithi117:> rm -f /home/ubuntu/cephtest/seed.ceph.conf /home/ubuntu/cephtest/ceph.pub
2020-06-05T13:43:00.698 INFO:tasks.cephadm:Stopping all daemons...
2020-06-05T13:43:00.699 INFO:tasks.cephadm.mon.a:Stopping mon.a...
2020-06-05T13:43:00.699 INFO:teuthology.orchestra.run.smithi117:> sudo systemctl stop ceph-fcbd512a-a731-11ea-a06b-001a4aab830c@mon.a
2020-06-05T13:43:00.717 INFO:ceph.mon.b.smithi006.stdout:Jun 05 13:43:00 smithi006 bash[6357]: audit 2020-06-05T13:42:59.751952+0000 mon.b (mon.2) 30 : audit [INF] from='mgr.34109 172.21.15.6:0/2239680911' entity='mgr.x' cmd=[{"prefix": "auth get", "entity": "0"}]: dispatch

@sebastian-philipp
Copy link
Contributor

ping @jmolmo

Comment on lines +890 to +899
host, # type str
entity, # type str
command, # type str
args, # type List[str]
addr = "", # type Optional[str]
stdin = "", # type Optional[str]
no_fsid = False, # type Optional[bool]
error_ok = False, # type Optional[bool]
image = False, # type Optional[str]
env_vars= None # type Optional[List[str]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this to the old style type annotation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I didn't know what is the preferred annotation type... by your comment I assume it is the "new" annotation style. I will change.

Comment on lines +19 to 39
def name_to_auth_entity(daemon_type, # type: str
daemon_id, # type: str
host = "" # type Optional[str] = ""
):
"""
Map from daemon names to ceph entity names (as seen in config)
Map from daemon names/host to ceph entity names (as seen in config)
"""
daemon_type = name.split('.', 1)[0]
if daemon_type in ['rgw', 'rbd-mirror', 'nfs', 'crash', 'iscsi']:
return 'client.' + name
if daemon_type in ['rgw', 'rbd-mirror', 'nfs', "iscsi"]:
return 'client.' + daemon_type + "." + daemon_id
elif daemon_type == 'crash':
return 'client.' + daemon_type + "." + host
elif daemon_type == 'mon':
return 'mon.'
elif daemon_type in ['osd', 'mds', 'mgr', 'client']:
return name
elif daemon_type == 'mgr':
return daemon_type + "." + daemon_id
elif daemon_type in ['osd', 'mds', 'client']:
return daemon_type + "." + daemon_id
else:
raise OrchestratorError("unknown auth entity name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want to add a pytest for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

If your cluster has nodes with a . in the name. This will happen.

Signed-off-by: Juan Miguel Olmo Martínez <jolmomar@redhat.com>
raise OrchestratorError('no hosts defined')
out, err, code = self._run_cephadm(
host, None, 'pull', [],
host, '', 'pull', [],
Copy link
Contributor

@sebastian-philipp sebastian-philipp Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, the upgrade test totally failed: http://pulpito.ceph.com/swagner-2020-06-23_11:55:14-rados:cephadm-wip-swagner-testing-2020-06-23-1057-distro-basic-smithi/5172323/

I'd like to verify this PR in a new run.

@sebastian-philipp sebastian-philipp added wip-swagner-testing My Teuthology tests and removed wip-swagner-testing My Teuthology tests labels Jun 23, 2020
@sebastian-philipp
Copy link
Contributor

@sebastian-philipp
Copy link
Contributor

jenkins test make check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants